[grok, attention]: support xai_temperature_len feature in attention for grok #217

agi-scaler · 2025-09-24T05:00:28Z

The original implementation in sglang can be found in https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/attention/triton_ops/decode_attention.py#L89-L94

In this PR, the xai_temperature_len feature is added to:

ref_ragged_paged_attention_fused
ref_ragged_paged_attention
ragged_paged_attention

Correctness verification:

decode attn varying seqlen, TPU vs GPU numerical difference is within 6e-2:

for prefill (extend attention), the difference is larger
extend attention without temperature is compared using this script: https://github.com/agi-scaler/sglang-jax/blob/temp-comp-baseline/test_sgl_baseline.py and https://github.com/agi-scaler/sglang-jax/blob/temp-comp-baseline/python/sgl_jax/test/test_flashattention_dump.py#L396-L409

# produce TPU qkvo:
python3 python/sgl_jax/test/test_flashattention_dump.py  -k test_gqa_prefill_accuracy_page_size_1_temperature_dump 

# compare with GPU qkvo:
python3 test_sgl_baseline.py -k  test_extend_attention_dump

The result difference without temperature can be as high as 3.4688, even without this PR:

Testing prefill_32_128_8_1_128_tempNone.npy
qkv_np.shape (128, 8, 128) (128, 8, 128) (1, 32, 128) False False False False False
Diff tensor([-0.0010,  0.0010,  0.0015,  ...,  0.0015,  0.0007,  0.0000],
       device='cuda:0', dtype=torch.bfloat16) tensor(**0.0063**, device='cuda:0', dtype=torch.bfloat16)
Testing prefill_32_128_8_3_20_tempNone.npy
qkv_np.shape (20, 8, 128) (20, 8, 128) (3, 32, 128) False False False False False
Diff tensor([-0.0273,  0.0386, -0.1289,  ...,  0.0684, -0.0016,  0.0391],
       device='cuda:0', dtype=torch.bfloat16) tensor(**1.0156**, device='cuda:0', dtype=torch.bfloat16)
Testing prefill_32_128_8_64_64_tempNone.npy
qkv_np.shape (64, 8, 128) (64, 8, 128) (64, 32, 128) False False False False False
Diff tensor([-0.3555,  1.1875, -0.3477,  ...,  0.0957,  0.0972,  0.1709],
       device='cuda:0', dtype=torch.bfloat16) tensor(**3.1719**, device='cuda:0', dtype=torch.bfloat16)
Testing prefill_32_128_8_20_20_tempNone.npy
qkv_np.shape (20, 8, 128) (20, 8, 128) (20, 32, 128) False False False False False
Diff tensor([-0.3848,  1.4375, -0.1680,  ...,  0.1387,  0.0261,  0.2852],
       device='cuda:0', dtype=torch.bfloat16) tensor(**3.4688**, device='cuda:0', dtype=torch.bfloat16)
Testing prefill_32_128_8_125_125_tempNone.npy
qkv_np.shape (125, 8, 128) (125, 8, 128) (125, 32, 128) False False False False False
Diff tensor([-0.5859,  1.2656, -0.4746,  ...,  0.0796,  0.0586,  0.1050],
       device='cuda:0', dtype=torch.bfloat16) tensor(**3.0469**, device='cuda:0', dtype=torch.bfloat16)
Testing prefill_32_128_8_123_522_tempNone.npy
qkv_np.shape (522, 8, 128) (522, 8, 128) (123, 32, 128) False False False False False
Diff tensor([-0.0039,  0.0703,  0.0415,  ...,  0.0181,  0.0278,  0.0210],
       device='cuda:0', dtype=torch.bfloat16) tensor(**0.4082**, device='cuda:0', dtype=torch.bfloat16)
Testing prefill_32_128_8_1_511_tempNone.npy
qkv_np.shape (511, 8, 128) (511, 8, 128) (1, 32, 128) False False False False False
Diff tensor([ 0.0000e+00,  0.0000e+00, -5.4932e-04,  ...,  4.8828e-04,
        -1.8311e-04,  6.1035e-05], device='cuda:0', dtype=torch.bfloat16) tensor(**0.0020**, device='cuda:0', dtype=torch.bfloat16)

when temperature is enabled, the numerical difference stays at the same level.

Performance benchmark: negligible performance difference compared to main branch without temperature.

raw benchmark data:
attn-benchmark.xlsx

gemini-code-assist · 2025-09-24T05:00:31Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Iamleos · 2025-09-24T12:46:41Z

please add some tests for page_size 1 with temperature attention enabled

Iamleos · 2025-09-24T12:47:49Z

please attach flash attention kernel benchmark results. Refer to benchmark/kernels/flash_attention/bench_flashattention.py

python/sgl_jax/test/test_flashattention.py

agi-scaler · 2025-09-26T04:25:40Z

@Iamleos benchmark data added to PR description.

agi-scaler · 2025-09-26T04:42:33Z

pagesize 1 test also added.

Iamleos

/LGTM

Iamleos · 2025-09-24T12:43:24Z

python/sgl_jax/srt/layers/attention/flash_attn_kernel/flash_attention.py

+                    q_batch = jnp.stack(q_heads, axis=0)
+
+                    if xai_temperature_len is not None:
+                        import numpy as np


why here import numpy

Iamleos · 2025-09-26T04:45:06Z

merge blocked, please fix the lint error

agi-scaler added 9 commits September 18, 2025 23:33

naive xai temp impl

071ab14

setup temperature

d63be11

stash

7b2fdc6

remove xai temp from tree

4eb3938

add validation and change default val to none

018be33

lint

7c2ec66

update test

ad69e35

:Merge remote-tracking branch 'origin/main' into lin/temperature

6dec461

add docs

f353fa6

jimoosciuc requested a review from Iamleos September 24, 2025 12:07

chhzh123 reviewed Sep 24, 2025

View reviewed changes

python/sgl_jax/test/test_flashattention.py Outdated Show resolved Hide resolved

remove unused print

1c02527

agi-scaler added 2 commits September 26, 2025 04:41

add page size 1 test

1088b7f

merge with latest

12ece97

Iamleos previously approved these changes Sep 26, 2025

View reviewed changes

lint

32e7929

agi-scaler dismissed Iamleos’s stale review via 32e7929 September 26, 2025 04:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[grok, attention]: support xai_temperature_len feature in attention for grok #217

[grok, attention]: support xai_temperature_len feature in attention for grok #217

Uh oh!

agi-scaler commented Sep 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Sep 24, 2025

Uh oh!

Iamleos commented Sep 24, 2025 •

edited

Loading

Uh oh!

Iamleos commented Sep 24, 2025

Uh oh!

Uh oh!

agi-scaler commented Sep 26, 2025

Uh oh!

agi-scaler commented Sep 26, 2025

Uh oh!

Iamleos left a comment

Uh oh!

Iamleos Sep 24, 2025

Uh oh!

Iamleos commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[grok, attention]: support xai_temperature_len feature in attention for grok #217

Are you sure you want to change the base?

[grok, attention]: support xai_temperature_len feature in attention for grok #217

Uh oh!

Conversation

agi-scaler commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Sep 24, 2025

Uh oh!

Iamleos commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Iamleos commented Sep 24, 2025

Uh oh!

Uh oh!

agi-scaler commented Sep 26, 2025

Uh oh!

agi-scaler commented Sep 26, 2025

Uh oh!

Iamleos left a comment

Choose a reason for hiding this comment

Uh oh!

Iamleos Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Iamleos commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

agi-scaler commented Sep 24, 2025 •

edited

Loading

Iamleos commented Sep 24, 2025 •

edited

Loading