Surpport kv cache int8/int4 for triton backend #1644

yuguo-Jack · 2024-10-12T04:53:18Z

Motivation

surpport kv cache int8/int4 for triton backend

Modifications

use c8 cmd:
--kv-cache-dtype int8
use c4 cmd:
--kv-cache-dtype int4 --kvint4-groupsize 32

merrymercy

Thanks for the contribution! They look good and i have a few comments.

python/sglang/srt/mem_cache/memory_pool.py

test/srt/test_triton_int4kv_attention_kernels.py

merrymercy · 2024-10-16T07:54:21Z

Hi @yuguo-Jack Can you add an end-to-end accuracy test, similar to this one?

sglang/test/srt/test_triton_attn_backend.py

Line 31 in a5114b6

def test_mmlu(self):

liangan1 · 2024-10-16T05:13:54Z

python/sglang/srt/mem_cache/memory_pool.py

+            self.k_buffer = [
+                torch.empty(
+                    (size + 1, head_num, head_dim // 2), dtype=torch.int8, device="cuda"
+                )


Except for the 'cuda', 'xpu' is also supported in the main branch. Change to use device=device as the original code? Also apply to other codes.

python/sglang/srt/mem_cache/memory_pool.py

merrymercy · 2024-10-23T06:28:49Z

@yuguo-Jack Can you follow up on this? This is a high priority item and we would like to merge this as soon as possible once an accuracy unit test is added

HaiShaw · 2024-10-28T09:35:24Z

python/sglang/srt/server_args.py

@@ -273,6 +281,7 @@ def add_cli_args(parser: argparse.ArgumentParser):
                "gptq_marlin",
                "awq_marlin",
                "bitsandbytes",
+                "compressed-tensors",


Does this feature only comes with compressed-tensors?
Can we decouple a bit, and add torchao's INT4/INT8 support too?

Can reuse a8w8 linear in vllm

merrymercy · 2024-10-30T01:04:09Z

@yuguo-Jack Can you resolve the conflicts and add some correctness tests?

Kernel-level unit tests. Make sure to compare it against a reference implementation, which can be a pytorch implementation or a triton implementation in fp16.
End-to-end unit tests. Test the MMLU score.

sglang/test/srt/test_triton_attention_backend.py

Line 31 in 5e00dde

def test_mmlu(self):

merrymercy · 2024-11-08T08:25:45Z

Let me know when the tests are added. I will review it again.

yuguo-Jack added 4 commits October 12, 2024 11:47

surpport kv cache int8/int4 for triton backend

c885e1f

fix code style

6bb1035

Merge branch 'main' into fork

f7ccfe1

fix

9ef8a84

merrymercy requested changes Oct 12, 2024

View reviewed changes

python/sglang/srt/mem_cache/memory_pool.py Show resolved Hide resolved

python/sglang/srt/mem_cache/memory_pool.py Outdated Show resolved Hide resolved

test/srt/test_triton_int4kv_attention_kernels.py Show resolved Hide resolved

merrymercy added the high priority label Oct 12, 2024

yuguo-Jack added 5 commits October 14, 2024 10:24

fix mem_pool

e5e46c2

Merge branch 'main' of https://github.com/sgl-project/sglang into fork

3cbee66

fix code style

42c27ef

Merge branch 'main' into fork

fe6d8c8

Merge branch 'main' into fork

8418289

liangan1 reviewed Oct 16, 2024

View reviewed changes

zhyncs requested review from Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners October 24, 2024 22:10

HaiShaw suggested changes Oct 28, 2024

View reviewed changes

merrymercy force-pushed the main branch from 55311eb to 2134f08 Compare November 2, 2024 01:26

yuguo-Jack added 3 commits November 6, 2024 16:02

Merge branch 'main' of https://github.com/sgl-project/sglang into fork

6b9067b

Merge branch 'fork' of https://github.com/yuguo-Jack/sglang into fork

af88ae6

Merge branch 'main' of https://github.com/sgl-project/sglang into fork

8cf4a33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surpport kv cache int8/int4 for triton backend #1644

Surpport kv cache int8/int4 for triton backend #1644

yuguo-Jack commented Oct 12, 2024

merrymercy left a comment

merrymercy commented Oct 16, 2024

liangan1 Oct 16, 2024 •

edited

Loading

yuguo-Jack Oct 30, 2024

merrymercy commented Oct 23, 2024

HaiShaw Oct 28, 2024

yuguo-Jack Oct 30, 2024

merrymercy commented Oct 30, 2024

merrymercy commented Nov 8, 2024

Surpport kv cache int8/int4 for triton backend #1644

Are you sure you want to change the base?

Surpport kv cache int8/int4 for triton backend #1644

Conversation

yuguo-Jack commented Oct 12, 2024

Motivation

Modifications

merrymercy left a comment

Choose a reason for hiding this comment

merrymercy commented Oct 16, 2024

liangan1 Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

yuguo-Jack Oct 30, 2024

Choose a reason for hiding this comment

merrymercy commented Oct 23, 2024

HaiShaw Oct 28, 2024

Choose a reason for hiding this comment

yuguo-Jack Oct 30, 2024

Choose a reason for hiding this comment

merrymercy commented Oct 30, 2024

merrymercy commented Nov 8, 2024

liangan1 Oct 16, 2024 •

edited

Loading