airMeng

AI Framework Engineer in Shanghai

Projects

Ongoing

Developed SYCL/DPCPP backend for llama.cpp referred from CUDA, achieving >10x performance gains on Intel GPU(Max, Flex, Arc) compared with OpenCL implementation.

Co-work with the community owners and response to issues related to Intel.

Published blog: Run LLMs on Intel GPUs Using llama.cpp medium.com/intel-analytic…

Ongoing

Intel-Extension-For-Transformers at Intel

Extending Hugging Face transformers APIs for Transformer-based models for collaborations with ecosystem

Highly optimized hand written X86 assembly kernels for Intel hardware, targeting advanced compression algorithm especially for LLM.

Develop GPU kernels for intel client GPU based on SYCL efficient low-level programming(ESIMD, "Explicit SIMD" SYCL extension)

2024

Intel-Extension-for-Pytorch at Intel

Working on weight-only-quantization optimization for Intel Client GPU. enabled on Windows & Linux, achieving geomean >2x performance gains compared with normal F16 implementation.

Blogs:

Llama2 support on MTL <intel.com/content/www/us…>

Llama3 day0 support on MTL iGPU <intel.com/content/www/us…>

Side Projects

Ongoing

Intel Neural Compressor at Intel

Python package for SOTA low-bit LLM quantization. worked on ONNXRunTime backend and finally integrated by ONNX community.

Writing

2022

Method and apparatus for accelerating deep leaning inference based on hw-aware sparsity pattern, US Patent

HW-aware sparsity patterns

patents.google.com/patent/WO20231…

2022

Methods and apparatus to perform artificial intelligence-based sparse computation based on hybrid pattern and dynamic encoding, US Patent

patents.google.com/patent/WO20240…

2021

Method and apparatus for optimizing inference of deep neural networks, US Patent

HW-aware cost model to predict performance for quantization recipes

patents.google.com/patent/WO20230…

Awards

2023

2023 Intel China Eployee of the Year(EOY)

Work Experience

2020 — Now

AI Framework Engineer at Intel

Shanghai