Skip to content
View liangan1's full-sized avatar

Block or report liangan1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
liangan1/README.md

Welcom to Liangang's Github

Liangang is an AI framework engineer in Intel and now is working on the LLM inference optimization.

My Contributons in Github

Tensor Parallel for LLM

Implementate the tensor parallel from the scratch and use Shared Memory based All-reduce to speedup.

PagedAttention

This kernel enables the flash decoding based the paged kv cache and it has been used in the vllm repository.

Flash Attention Kernel for Chunked Prefill/Prefix Cache/Speculative Decoding

For Chunked Prefill/Prefix Cache/Speculative Decoding, a part of the key/value token states has been cached and the query lenghth of this step is not 1. In this kernel we enabled the flash attention algorithm for this case. the API is similar to the falsh_attn_val_len in the flash_attn repo. With this kernel, the chunked-prefill can bring 15% performance gain.

Indirect Access KV Cache

Indirect Access KV_cache (IAKV) is a similar solution to PagedAttention and it is used to reduce the memory overheads caused by the KV cache. Firstly, IAKV pre-allocates buffers (key and value use different buffer) to store all key/value hidden states and beam index information, the data format is shown in the following left figure (beam_width=4 in this case) and token state of key (value) in every timestamp will be store in this pre-allocated buffer. Secondly, we can use beam index history which is shown in the following right figure to decide which beam should be used by a timestamp and this information will generate a offset to access the kv_cache buffer which means that the reorder_cache and concat overheads will be eliminated by this way.

Data Layout of KV Cache Beam index history

Rotary Position Embeeding

Support multple LLM models. e.g., lamma/gpt-neox/falcon/GPT-J 6B/CodeGen/ChatGLM...

More contiributions can be found here

My Publications and Talks

基于至强处理器的AI软件生态

A Novel Scale-Out Training Solution for Deep Learning Recommender Systems

Popular repositories Loading

  1. pytorch_imperative_quantization_tool pytorch_imperative_quantization_tool Public

    Python 4 1

  2. data_format_convert_tools data_format_convert_tools Public

    Python 1

  3. test_git test_git Public

    learn to use git in practice

  4. mkl-dnn mkl-dnn Public

    Forked from oneapi-src/oneDNN

    Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)

    C++

  5. Paddle Paddle Public

    Forked from PaddlePaddle/Paddle

    PArallel Distributed Deep LEarning (PaddlePaddle核心框架,高性能单机、分布式训练和跨平台部署)

    C++

  6. news news Public

    This project is used to get the news for post-graduate entrance examination of computer science.

    Python