Liangang is an AI framework engineer in Intel and now is working on the LLM inference optimization.
Implementate the tensor parallel from the scratch and use Shared Memory based All-reduce to speedup.
This kernel enables the flash decoding based the paged kv cache and it has been used in the vllm repository.
Flash Attention Kernel for Chunked Prefill/Prefix Cache/Speculative Decoding
For Chunked Prefill/Prefix Cache/Speculative Decoding, a part of the key/value token states has been cached and the query lenghth of this step is not 1. In this kernel we enabled the flash attention algorithm for this case. the API is similar to the falsh_attn_val_len in the flash_attn repo. With this kernel, the chunked-prefill can bring 15% performance gain.
Indirect Access KV_cache (IAKV) is a similar solution to PagedAttention and it is used to reduce the memory overheads caused by the KV cache. Firstly, IAKV pre-allocates buffers (key and value use different buffer) to store all key/value hidden states and beam index information, the data format is shown in the following left figure (beam_width=4 in this case) and token state of key (value) in every timestamp will be store in this pre-allocated buffer. Secondly, we can use beam index history which is shown in the following right figure to decide which beam should be used by a timestamp and this information will generate a offset to access the kv_cache buffer which means that the reorder_cache and concat overheads will be eliminated by this way.
Support multple LLM models. e.g., lamma/gpt-neox/falcon/GPT-J 6B/CodeGen/ChatGLM...
More contiributions can be found here
A Novel Scale-Out Training Solution for Deep Learning Recommender Systems