airMeng

airMeng

AI Framework Engineer in Shanghai

Projects

Ongoing

Developed SYCL/DPCPP backend for llama.cpp referred from CUDA, achieving >10x performance gains on Intel GPU(Max, Flex, Arc) compared with OpenCL implementation.

Co-work with the community owners and response to issues related to Intel.

Published blog: Run LLMs on Intel GPUs Using llama.cpp medium.com/intel-analytic…

Ongoing

Extending Hugging Face transformers APIs for Transformer-based models for collaborations with ecosystem

Highly optimized hand written X86 assembly kernels for Intel hardware, targeting advanced compression algorithm especially for LLM.

Develop GPU kernels for intel client GPU based on SYCL efficient low-level programming(ESIMD, "Explicit SIMD" SYCL extension)

2024

Working on weight-only-quantization optimization for Intel Client GPU. enabled on Windows & Linux, achieving geomean >2x performance gains compared with normal F16 implementation.

Blogs:

Llama2 support on MTL <intel.com/content/www/us…>

Llama3 day0 support on MTL iGPU <intel.com/content/www/us…>

Side Projects

Ongoing

Python package for SOTA low-bit LLM quantization. worked on ONNXRunTime backend and finally integrated by ONNX community.

Writing

2022
Method and apparatus for accelerating deep leaning inference based on hw-aware sparsity pattern, US Patent

HW-aware sparsity patterns

patents.google.com/patent/WO20231…

2022
Methods and apparatus to perform artificial intelligence-based sparse computation based on hybrid pattern and dynamic encoding, US Patent
2021
Method and apparatus for optimizing inference of deep neural networks, US Patent

HW-aware cost model to predict performance for quantization recipes

patents.google.com/patent/WO20230…

Awards

2023
2023 Intel China Eployee of the Year(EOY)

Work Experience

2020 — Now
AI Framework Engineer at Intel
Shanghai