Skip to content

Add paged attention highperf JIT example#193

Open
MirkoDeVita98 wants to merge 11 commits into
huawei-csl:mainfrom
MirkoDeVita98:paged_attention
Open

Add paged attention highperf JIT example#193
MirkoDeVita98 wants to merge 11 commits into
huawei-csl:mainfrom
MirkoDeVita98:paged_attention

Conversation

@MirkoDeVita98

@MirkoDeVita98 MirkoDeVita98 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator
paged_attention_highperf_jit b1_h32_kv8_s128_bs128_fp16: 163.859 us/iter, 0.012948 TFLOPS logical, 0.310755 TFLOPS normalized, 0.003300 TB/s, block_dim=24
paged_attention_highperf_jit b1_h32_kv8_s512_bs128_fp16: 591.878 us/iter, 0.014339 TFLOPS logical, 0.344132 TFLOPS normalized, 0.003571 TB/s, block_dim=24
paged_attention_highperf_jit b1_h32_kv8_s4096_bs128_fp16: 84.563 us/iter, 0.802897 TFLOPS logical, 19.269532 TFLOPS normalized, 0.198595 TB/s, block_dim=24
paged_attention_highperf_jit b1_h32_kv8_s8192_bs128_fp16: 83.516 us/iter, 1.625915 TFLOPS logical, 39.021950 TFLOPS normalized, 0.401970 TB/s, block_dim=24
paged_attention_highperf_jit b1_h32_kv8_s16384_bs128_fp16: 84.572 us/iter, 3.211226 TFLOPS logical, 77.069432 TFLOPS normalized, 0.793708 TB/s, block_dim=24
paged_attention_highperf_jit b1_h32_kv8_s32768_bs128_fp16: 124.251 us/iter, 4.371499 TFLOPS logical, 104.915984 TFLOPS normalized, 1.080356 TB/s, block_dim=24
paged_attention_highperf_jit b1_h32_kv8_s65536_bs128_fp16: 294.831 us/iter, 3.684570 TFLOPS logical, 88.429676 TFLOPS normalized, 0.910535 TB/s, block_dim=24
paged_attention_highperf_jit b1_h32_kv8_s131072_bs128_fp16: 565.181 us/iter, 3.844167 TFLOPS logical, 92.260006 TFLOPS normalized, 0.949946 TB/s, block_dim=24
paged_attention_highperf_jit b2_h32_kv8_s128_bs128_fp16: 319.364 us/iter, 0.013287 TFLOPS logical, 0.318884 TFLOPS normalized, 0.003386 TB/s, block_dim=24
paged_attention_highperf_jit b2_h32_kv8_s512_bs128_fp16: 1236.075 us/iter, 0.013732 TFLOPS logical, 0.329566 TFLOPS normalized, 0.003420 TB/s, block_dim=24
paged_attention_highperf_jit b2_h32_kv8_s4096_bs128_fp16: 86.001 us/iter, 1.578937 TFLOPS logical, 37.894486 TFLOPS normalized, 0.390546 TB/s, block_dim=24
paged_attention_highperf_jit b2_h32_kv8_s8192_bs128_fp16: 93.225 us/iter, 2.913185 TFLOPS logical, 69.916430 TFLOPS normalized, 0.720218 TB/s, block_dim=24
paged_attention_highperf_jit b2_h32_kv8_s16384_bs128_fp16: 136.297 us/iter, 3.985131 TFLOPS logical, 95.643153 TFLOPS normalized, 0.984991 TB/s, block_dim=24
paged_attention_highperf_jit b2_h32_kv8_s32768_bs128_fp16: 311.692 us/iter, 3.485246 TFLOPS logical, 83.645895 TFLOPS normalized, 0.861331 TB/s, block_dim=24
paged_attention_highperf_jit b2_h32_kv8_s65536_bs128_fp16: 579.997 us/iter, 3.745965 TFLOPS logical, 89.903163 TFLOPS normalized, 0.925708 TB/s, block_dim=24
paged_attention_highperf_jit b2_h32_kv8_s131072_bs128_fp16: 1113.284 us/iter, 3.903135 TFLOPS logical, 93.675231 TFLOPS normalized, 0.964518 TB/s, block_dim=24
paged_attention_highperf_jit b4_h32_kv8_s128_bs128_fp16: 524.086 us/iter, 0.016193 TFLOPS logical, 0.388638 TFLOPS normalized, 0.004127 TB/s, block_dim=24
paged_attention_highperf_jit b4_h32_kv8_s512_bs128_fp16: 2120.588 us/iter, 0.016008 TFLOPS logical, 0.384204 TFLOPS normalized, 0.003987 TB/s, block_dim=24
paged_attention_highperf_jit b4_h32_kv8_s4096_bs128_fp16: 94.590 us/iter, 2.871150 TFLOPS logical, 68.907603 TFLOPS normalized, 0.710172 TB/s, block_dim=24
paged_attention_highperf_jit b4_h32_kv8_s8192_bs128_fp16: 147.276 us/iter, 3.688056 TFLOPS logical, 88.513340 TFLOPS normalized, 0.911787 TB/s, block_dim=24
paged_attention_highperf_jit b4_h32_kv8_s16384_bs128_fp16: 328.326 us/iter, 3.308680 TFLOPS logical, 79.408326 TFLOPS normalized, 0.817795 TB/s, block_dim=24
paged_attention_highperf_jit b4_h32_kv8_s32768_bs128_fp16: 591.361 us/iter, 3.673980 TFLOPS logical, 88.175520 TFLOPS normalized, 0.907974 TB/s, block_dim=24
paged_attention_highperf_jit b4_h32_kv8_s65536_bs128_fp16: 1130.213 us/iter, 3.844672 TFLOPS logical, 92.272123 TFLOPS normalized, 0.950100 TB/s, block_dim=24
paged_attention_highperf_jit b4_h32_kv8_s131072_bs128_fp16: 2205.480 us/iter, 3.940457 TFLOPS logical, 94.570970 TFLOPS normalized, 0.973741 TB/s, block_dim=24
paged_attention_highperf_jit b8_h32_kv8_s128_bs128_fp16: 1074.693 us/iter, 0.015794 TFLOPS logical, 0.379047 TFLOPS normalized, 0.004025 TB/s, block_dim=24
paged_attention_highperf_jit b8_h32_kv8_s512_bs128_fp16: 4260.087 us/iter, 0.015937 TFLOPS logical, 0.382498 TFLOPS normalized, 0.003969 TB/s, block_dim=24
paged_attention_highperf_jit b8_h32_kv8_s4096_bs128_fp16: 144.032 us/iter, 3.771130 TFLOPS logical, 90.507114 TFLOPS normalized, 0.932780 TB/s, block_dim=24
paged_attention_highperf_jit b8_h32_kv8_s8192_bs128_fp16: 359.382 us/iter, 3.022753 TFLOPS logical, 72.546072 TFLOPS normalized, 0.747306 TB/s, block_dim=24
paged_attention_highperf_jit b8_h32_kv8_s16384_bs128_fp16: 629.722 us/iter, 3.450174 TFLOPS logical, 82.804171 TFLOPS normalized, 0.852767 TB/s, block_dim=24
paged_attention_highperf_jit b8_h32_kv8_s32768_bs128_fp16: 1167.075 us/iter, 3.723239 TFLOPS logical, 89.357735 TFLOPS normalized, 0.920148 TB/s, block_dim=24
paged_attention_highperf_jit b8_h32_kv8_s65536_bs128_fp16: 2249.407 us/iter, 3.863506 TFLOPS logical, 92.724154 TFLOPS normalized, 0.954754 TB/s, block_dim=24
paged_attention_highperf_jit b8_h32_kv8_s131072_bs128_fp16: 4386.033 us/iter, 3.962851 TFLOPS logical, 95.108417 TFLOPS normalized, 0.979275 TB/s, block_dim=24
paged_attention_highperf_jit b32_h32_kv8_s128_bs128_fp16: 5270.119 us/iter, 0.012883 TFLOPS logical, 0.309184 TFLOPS normalized, 0.003283 TB/s, block_dim=24
paged_attention_highperf_jit b32_h32_kv8_s512_bs128_fp16: 16902.542 us/iter, 0.016067 TFLOPS logical, 0.385617 TFLOPS normalized, 0.004001 TB/s, block_dim=24
paged_attention_highperf_jit b32_h32_kv8_s4096_bs128_fp16: 629.548 us/iter, 3.451125 TFLOPS logical, 82.826997 TFLOPS normalized, 0.853628 TB/s, block_dim=24
paged_attention_highperf_jit b32_h32_kv8_s8192_bs128_fp16: 1291.156 us/iter, 3.365430 TFLOPS logical, 80.770324 TFLOPS normalized, 0.832025 TB/s, block_dim=24
paged_attention_highperf_jit b32_h32_kv8_s16384_bs128_fp16: 2342.142 us/iter, 3.710534 TFLOPS logical, 89.052815 TFLOPS normalized, 0.917120 TB/s, block_dim=24
paged_attention_highperf_jit b32_h32_kv8_s32768_bs128_fp16: 4591.011 us/iter, 3.785918 TFLOPS logical, 90.862038 TFLOPS normalized, 0.935638 TB/s, block_dim=24
paged_attention_highperf_jit b32_h32_kv8_s65536_bs128_fp16: 8903.901 us/iter, 3.904175 TFLOPS logical, 93.700201 TFLOPS normalized, 0.964805 TB/s, block_dim=24
paged_attention_highperf_jit b32_h32_kv8_s131072_bs128_fp16: 17487.606 us/iter, 3.975660 TFLOPS logical, 95.415846 TFLOPS normalized, 0.982440 TB/s, block_dim=24
paged_attention_highperf_jit b64_h32_kv8_s128_bs128_fp16: 8448.134 us/iter, 0.016073 TFLOPS logical, 0.385751 TFLOPS normalized, 0.004096 TB/s, block_dim=24
paged_attention_highperf_jit b64_h32_kv8_s512_bs128_fp16: 36634.255 us/iter, 0.014827 TFLOPS logical, 0.355836 TFLOPS normalized, 0.003692 TB/s, block_dim=24
paged_attention_highperf_jit b64_h32_kv8_s4096_bs128_fp16: 1210.145 us/iter, 3.590723 TFLOPS logical, 86.177353 TFLOPS normalized, 0.888157 TB/s, block_dim=24
paged_attention_highperf_jit b64_h32_kv8_s8192_bs128_fp16: 2568.584 us/iter, 3.383418 TFLOPS logical, 81.202037 TFLOPS normalized, 0.836472 TB/s, block_dim=24
paged_attention_highperf_jit b64_h32_kv8_s16384_bs128_fp16: 4562.386 us/iter, 3.809671 TFLOPS logical, 91.432105 TFLOPS normalized, 0.941623 TB/s, block_dim=24
paged_attention_highperf_jit b64_h32_kv8_s32768_bs128_fp16: 9184.345 us/iter, 3.784961 TFLOPS logical, 90.839063 TFLOPS normalized, 0.935401 TB/s, block_dim=24
paged_attention_highperf_jit b64_h32_kv8_s65536_bs128_fp16: 17788.347 us/iter, 3.908445 TFLOPS logical, 93.802683 TFLOPS normalized, 0.965860 TB/s, block_dim=24
paged_attention_highperf_jit b64_h32_kv8_s131072_bs128_fp16: 34976.431 us/iter, 3.975522 TFLOPS logical, 95.412523 TFLOPS normalized, 0.982406 TB/s, block_dim=24

Comment thread examples/jit_cpp/paged_attention_highperf/pa_kernel_impl.hpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants