
Projects
Developed SYCL/DPCPP backend for llama.cpp referred from CUDA, achieving >10x performance gains on Intel GPU(Max, Flex, Arc) compared with OpenCL implementation.
Co-work with the community owners and response to issues related to Intel.
Published blog: Run LLMs on Intel GPUs Using llama.cpp medium.com/intel-analytic…
Extending Hugging Face transformers APIs for Transformer-based models for collaborations with ecosystem
Highly optimized hand written X86 assembly kernels for Intel hardware, targeting advanced compression algorithm especially for LLM.
Develop GPU kernels for intel client GPU based on SYCL efficient low-level programming(ESIMD, "Explicit SIMD" SYCL extension)
Working on weight-only-quantization optimization for Intel Client GPU. enabled on Windows & Linux, achieving geomean >2x performance gains compared with normal F16 implementation.
Blogs:
Llama2 support on MTL <intel.com/content/www/us…>
Llama3 day0 support on MTL iGPU <intel.com/content/www/us…>
Side Projects
Python package for SOTA low-bit LLM quantization. worked on ONNXRunTime backend and finally integrated by ONNX community.
Writing
HW-aware sparsity patterns
HW-aware cost model to predict performance for quantization recipes