Skip to content
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `models/glm52/ep8-deepep-moe.md` | PR4: GLM-baked DeepEP v2 shim instantiation replaces PR3's local scatter/combine; loader places experts into their packed layout at H2D time (post-load repack cannot fit HBM); rank 0 runs the full 78-layer spine + bs=1 greedy coordinator, ranks 1..7 replay the 75 MoE collectives per step. Gates: EP8 layer-6 oracle 62/64 (same outliers as EP1), full-model e2e generation. |
| `models/glm52/ep1-forward.md` | PR3 built + all gates green on jz-38 H200 (2026-07-03): MoE/dense/bookend bricks (cherry-picked from the PP8 branch, re-gated via the #499 harness) + decoder-layer composition with cross-layer top-k sharing. MoE chain shaped to the DeepEP v2 elastic shim contract, Grouped + GEMV expert paths behind one signature; graph capturability as the bar. Gates: bookend exact, layer-0 dense 64/64, layer-6 MoE 62/64 both paths (measured router near-ties, bounded allowance). |
| `models/glm52/bs1-decode-serial-overhead.md` | PR5a perf pass on the PR4 bring-up path: 101–103 → 46–50 ms/step (~2.2×) at bs=1, output byte-identical, all gates green. Fixes: quant/SiLU/GEMM rows bounded by the coordinator token count (device trap on violation), persistent MoE workspace (was ~11.6k allocs/step), FlashMLA sched metadata hoisted to build. Remaining gap = launch overhead → PR5c graph target. |
| `models/glm52/kernel-perf-decode.md` | Measured single-layer MLA decode kernel ladder (`glm52_kernel_bench`, H100 bs=1, parity-verified): as-is ~288µs → arena 218 (sync `cudaMalloc` serializes the stream) → tile-schedule hoist 190 → CUDA-graph 166µs = **−42%/layer**. Measured non-result: `num_sm_parts` tuning (isolated flashmla 1.68× at 16) does not carry through the graphed forward. Next lever: sparse partial+combine fusion, deprioritized by measurement. |

## models / deepseek-v4

Expand Down
39 changes: 39 additions & 0 deletions docs/lessons/megakernels-for-decode-latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Megakernels for decode latency — what transfers to openinfer

**TL;DR**: Megakernels (one persistent kernel running the whole forward via an on-GPU interpreter + counter sync + smem paging) buy their wins from three separable things — launch elimination, inter-kernel pipeline bubbles, producer→consumer locality. CUDA Graphs only buy the first (~2.1→1.3µs/launch), leaving a measured ~50%→78% HBM-bandwidth gap on bs=1 decode. For openinfer the value ladder is: (0) arenas + graphs where they don't exist yet (GLM5.2 — dwarfs everything else), (1) ThunderMLA-style fusion of attention partial+reduction pairs (20–35% precedent, ~250 LoC), (2) per-layer persistent "glue" kernels on the 7-instruction template, (3) whole-model fusion — explicitly not worth its maintenance cost today.

## The four systems and their mechanisms

- **Llama-1B megakernel** ([no-bubbles, Hazy Research](https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles)): whole forward = one persistent kernel; per-SM instruction sequences scheduled host-side ahead of time and reused across passes; only **7 fused instruction types** (norm+QKV+RoPE, attention, O-proj+residual, norm+up-gate+SiLU, down-proj+residual, LM head). Sync = global counter array, dependents spin; the down-proj splits its input into 4 counter-guarded segments so consumers start early. Shared memory carved into 13×16KiB explicitly requested/released pages — a released page hands straight to the next instruction's weight prefetch. Results: <1ms/token Llama-1B bs=1 on H100, 2.5× vLLM, **78% of HBM bandwidth vs ~50% for kernel-per-op engines**. Post-fusion cost breakdown (600µs on B200): 250µs activation movement+sync, 200µs norm+matvec, 40µs warp sync — after launches die, *sync and activation movement dominate*.
- **ThunderMLA** ([Hazy](https://hazyresearch.stanford.edu/blog/2025-03-04-thundermla)): fuses FlashMLA's two kernels (split-KV partials + reduction) into one persistent kernel driven by an instruction tensor; static host schedulers (heap-based, plus makespan-backwards for ~10% extra). **20–35% over FlashMLA**; gains shrink as per-launch work grows (+36% at B64/512, +7.6% at B132/4K). Public: ThunderKittens repo, `mla` branch, ~250 lines of device code. **Dense MLA only** — no paged KV, no sparse top-k, no fp8 KV.
- **Mirage MPK** ([Zhihao Jia](https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17)): compiler from a *PyTorch graph* to one megakernel; worker SMs + up to 4 scheduler SMs with event counters; 1–2µs task transitions. Static task graphs only (no MoE dynamism), tightly coupled to the Mirage stack — not embeddable from a Rust FFI engine today; useful as a design reference for the worker/scheduler split.
- **TP-Llama-70B** ([Hazy](https://hazyresearch.stanford.edu/blog/2025-09-28-tp-llama-main)): the interpreter model across 8×H100 with comm fused *inside* the kernel (async peer-memory stores, no NCCL); dynamic global work queue (+14.2% at bs 8192). Relevant later for EP/TP paths; the authors call the code unsupported and "sensitive to … being looked at the wrong way".

## What this means per openinfer model line

- **GLM5.2 (bring-up)**: the single-layer MLA decode currently issues ~18 launches and ~20 `cudaMalloc`s per layer per token (78 layers). Step 0 is an allocation arena + CUDA Graph capture — boring, and worth more than any fusion. Step 1 is the ThunderMLA pattern applied to our *sparse* FlashMLA decode: port the interpreter/instruction-tensor structure onto the vendored sparse kernel (drop-in is not possible — ThunderMLA is dense); our fixed top-k=2048 workload sits exactly in the small/medium-work regime where their fusion gains were largest. Step 2, if step 1 pays: one persistent kernel per layer fusing the glue (q/kv projections, norm, RoPE, cache-pack) on the 7-instruction template.
- **Qwen3-4B (already graphed)**: graphs killed launch overhead; the remaining megakernel upside is bubbles + locality, ceiling plausibly 1.3–1.5× if decode is bandwidth-bound (50%→78% measured elsewhere). Real but expensive; revisit only after cheaper wins (fused step-tail, sampling cost #483) are exhausted.
- **Scheduling flavor**: at bs=1, static host-side per-SM schedules (1B-style) are enough and far simpler than dynamic queues; cache the schedule like ThunderMLA does (1–2ms to build, amortized).

## Pitfalls the authors state outright

Instruction-set design and counter-sync debugging carry "tremendous complexity"; the TP kernel is explicitly unmaintained. None of the megakernel posts benchmark against CUDA Graphs except no-bubbles — treat vLLM/SGLang comparisons in MPK/TP posts as launch-overhead-inclusive. Any adoption here should ship as one model line's experiment with an A/B against that line's graphed baseline, never as shared-layer infrastructure first.

## Measured GLM5.2 single-layer baseline (H100, bs=1, `glm52_kernel_bench`)

Synthetic weights, parity-verified (the scratch forward is bitwise-identical to the alloc-heavy forward before timing). Two context lengths, iters=64:

| | as-is gpu / wall | scratch gpu / wall | alloc bill |
|---|---|---|---|
| ctx 512 | 286.5 / 296.8 µs | 218.2 / 228.3 µs | **68.5 µs/layer** |
| ctx 2048 | 291.1 / 301.6 µs | 219.6 / 230.0 µs | **71.6 µs/layer** |

Per-stage (ctx 2048): o_proj 67.7 µs, kv_a 28.6, q_a 28.7, q_b 24.8 (each incl. its 4-malloc quant→relayout→GEMM chain); flashmla sparse decode 48.2 µs; assembly family (assemble+quant+pack, buffers reused) 8.4 µs. Context length barely moves the total (sparse top-k=2048 caps attention work).

**The headline: eliminating per-call `cudaMalloc`s (the zero-alloc scratch forward) recovers ~70 µs/layer — 24% of the per-layer attention path — and it drops the *GPU-measured* time too (286→218 µs), not just wall.** That proves the synchronous mallocs were serializing against the stream, exactly the "step 0 dwarfs fusion" thesis above, now with real numbers. Projected 75-MoE-layer attention share: 22.3 ms/token as-is → ~17 ms/token with the arena alone.

Measured on an R535 host via a 3-symbol `cuLibrary*`-enumeration shim (the box's 12.2 driver lacks the 12.4+ enumeration APIs cudarc calls); the shim only stubs kernel *enumeration* — dispatch uses the real `cuLibraryGetKernel`-by-name, and the parity assertion guards against any silent kernel-load breakage, so the timings are of real kernel execution.

## Next

Step 0 (allocation arena + CUDA Graph on the decode path) is now quantified at ~24%/layer and is the clear first move — bigger than any fusion. Step 1 (sparse-ThunderMLA fusion of the 48 µs flashmla partial+reduction) is the next-largest single lever; worth a design issue once step 0 lands. The scratch forward in this branch is the arena half of step 0.
67 changes: 67 additions & 0 deletions docs/models/glm52/kernel-perf-decode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# GLM5.2 decode kernel performance — measured baseline and optimization ladder

**TL;DR**: After #535 hoisted the FlashMLA tile schedule for every path, one GLM5.2 MLA decode layer costs ~268 µs GPU / 278 µs wall at bs=1 on H100 (measured, parity-verified). Two stacked, implemented optimizations cut it to **168 µs GPU / 178 µs wall (−36%)**: **(0a) an MLA-layer allocation arena** removes ~73 µs — the bring-up forward still does ~20 synchronous `cudaMalloc`s per layer per token (#535's persistent workspace covered the MoE chain, not MLA), and they serialize the decode stream, so eliminating them drops GPU time, not just host time; **(0c) capturing the forward into a CUDA Graph** removes ~27 µs more — graph replay launches the ~18 kernels back-to-back and removes the inter-kernel bubbles where the GPU idled between host launches (this is the "PR5c graph target" the #535 doc names). The scratch's schedule handling reuses #535's `Glm52MlaSchedMetadata` (one plan type, no duplicate). Next design lever assessment unchanged: partial+combine fusion is deprioritized by measurement. Everything below is from `glm52_kernel_bench` on a real H100, parity-verified against the bring-up forward.

Last touched: 2026-07

## Measured baseline (`glm52_kernel_bench`, bs=1, synthetic weights, iters=64)

| stage (ctx 2048, on top of #535) | gpu | wall | cumulative saved |
|---|---|---|---|
| as-is forward (incl. #535's hoisted schedule) | 267.6 µs | 278.0 µs | — |
| 0a MLA arena (`Glm52MlaDecodeScratch`) | 194.7 µs | 205.1 µs | −73 µs |
| **0c + CUDA Graph** | **167.9 µs** | **178.3 µs** | **−100 µs (−36%)** |

ctx 512 tracks it (graph 168.6 / 178.8 µs). Parity-verified bitwise against the bring-up forward. History: pre-#535 this ladder read 288 → 218 (arena) → 190 (schedule hoist) → 166 µs (−42%); #535 landed the schedule hoist for every path (as-is dropped 288 → 268), so this branch's remaining contribution is the arena + the graph.

ctx 512 tracks it (graph 165.6 / 175.8 µs). All three are parity-verified against the alloc-heavy forward. Projected 75-MoE-layer attention share: 22.3 ms/token → **~13.1 ms/token**.

Per stage (ctx 2048, alloc chain included in the projections):

| stage | wall | notes |
|---|---|---|
| o_proj `fp8_linear` | 67.7 µs | [1,16384]·[16384,6144] fp8 — the widest projection |
| kv_a / q_a / q_b | 28.6 / 28.7 / 24.8 µs | quant → TMA-relayout → blockscale GEMM, 4 mallocs each |
| flashmla sparse decode | 48.2 µs | metadata + split-KV partial + combine (3 kernels) |
| assembly family | 8.4 µs | query-assemble + kv quant + cache-pack (buffers reused) |

Context length barely moves the total (286 → 291 µs from 512 → 2048) because sparse top-k = 2048 caps the attended set.

**Measurement provenance**: built from `feat/glm52-kernel-bench` with the CUDA 12.9 toolkit; run on an R535 host (driver 12.2) via a 3-symbol `cuLibrary*`-enumeration `LD_PRELOAD` shim, since cudarc 0.19 calls the 12.4+ enumeration APIs the old driver lacks. The shim only stubs kernel *enumeration* — dispatch is the real `cuLibraryGetKernel`-by-name, and the bench's `verify_scratch_parity` asserts the scratch forward is bitwise-identical before any timing, so a broken load fails loudly rather than faking numbers. A real serving path (and CUDA Graph capture) needs an R550+ driver.

## Step 0 — implemented and measured (−34%/layer)

**0a — allocation arena (68 µs).** The correctness-first bring-up allocates every intermediate fresh (`alloc_zeros`) per projection per token: whole MLA layer ≈ 20 `cudaMalloc`s. Each is synchronous and serializes against the decode stream, so the cost shows up in *GPU* time (287 → 218 µs), not just host time. `Glm52MlaDecodeScratch` + `glm52_mla_decode_forward_into` pre-allocate all 20 buffers once and reuse them.

**0b — hoist the FlashMLA tile schedule (28 µs).** The sparse-decode `metadata` kernel builds `tile_scheduler_metadata` + `num_splits` from `batch_size` and `num_sm_parts` only — both fixed by the contract, independent of the per-token query/KV. The bring-up re-ran it every layer every token; `Glm52MlaDecodeScratch::new` now computes it once and the decode path reuses it (218 → 190 µs). Correctness is guarded by the bench's `verify_scratch_parity` (bitwise vs the alloc-heavy forward that still recomputes it), so the data-independence claim is checked, not assumed. In real serving the schedule must be re-cached whenever `batch_size` changes (num_sm_parts is a device constant); for bs=1 latency decode it is computed exactly once.

**0c — CUDA Graph capture (23 µs).** With the arena (0a) and schedule hoist (0b), the forward is a pure kernel sequence, so `CudaGraphState::run_or_capture` (openinfer-core) captures it once and replays with one `cuGraphLaunch`. This removes host launch overhead *and* GPU time: graph replay issues the ~18 kernels back-to-back, so the GPU stops idling between them waiting for the next host launch (188 → 165 µs GPU). Graph capture/launch are CUDA 11.x APIs, so this runs on the R535 host (unlike the cudarc module-enumeration path that needs the shim). `measure_forward_graph` in the bench captures against the same scratch and reports 165 µs.

## Measured non-result: `num_sm_parts` tuning doesn't help the graphed forward

`current_sm90_num_sm_parts` fills all SMs (132 on H100). For bs=1 top-k=2048 that over-splits — each split handles ~16 KV entries and the combine reduces a 132-way, 17.3 MB `o_accum`. Sweeping the split count (`measure_flashmla_at`) on the **isolated** flashmla stage shows a real 1.68× at `num_sm_parts=16` (48.1 → 27.8 µs), output bitwise-identical to the default (`flashmla_parts_max_diff(16) = 0`, so it's a pure parallelization knob).

**But it does not carry through to the optimized forward.** Measured end-to-end (`--sm-parts 16`): the arena+hoist+graph forward is 169.1 µs at parts=16 vs 166.6 µs at parts=132 — no gain, marginally worse. Two reasons, both only visible end-to-end:
- The isolated 20 µs was dominated by the **metadata kernel**, which step 0b already hoists out of the per-token path — the decode partial+combine alone differs by only ~1.6 µs between 16 and 132 splits (scratch forward 190.7 vs 192.3 µs).
- 16 splits use 16/132 SMs, so the partial underutilizes the GPU in the graphed pipeline, offsetting the smaller combine.

(The 30 µs the ungraphed as-is forward saves at parts=16 — 290 → 260 µs — is mostly the cheaper `cudaMalloc` of the smaller accum buffers, which the arena already eliminates.) This corrects an earlier estimate that projected ~19 µs from this tuning; the real measurement is the opposite. It also lowers the expected payoff of step 1 below.

## Step 1 — fuse the FlashMLA sparse partial+combine (smaller than it first looked)

`glm52_flashmla_sparse_decode_launch` runs two CUDA kernels (`csrc/glm52/glm52_flashmla_sparse.cu`):

1. **split-KV partial** (`run_flash_splitkv_mla_fp8_sparse_kernel`) — splits the top-k=2048 KV across `num_sm_parts` SMs, each writing a partial `o_accum` + `lse_accum` to HBM.
2. **combine** (`CombineParams`) — reads every partial back, does the log-sum-exp reduction into the final `out_latent` + `lse`.

On H100 `num_sm_parts = multiProcessorCount / kSq / (kHeads/64) = 132` (one split per SM). With `stride_o_accum_split = kSq·kHeads·kVDim = 1·64·512`, `o_accum` is `132 · 32768 · f32 = 17.3 MB`. The partial writes it and the combine reads it: **~34.6 MB round-trip ≈ 10.3 µs at 3.35 TB/s**, plus one kernel launch (~2 µs graphed/ungraphed).

**ThunderMLA transfer**: fuse partial+combine into one persistent kernel driven by a host-side instruction/tile schedule, and do the cross-split reduction through SM90 thread-block clusters / distributed shared memory instead of the HBM `o_accum` round-trip. Precedent: ThunderKittens `mla` branch, ~250 LoC device, 20–35% over FlashMLA. **But the `num_sm_parts` measurement above resets the expectation**: in the graphed forward the partial+combine only costs ~1.6 µs more at 132 splits than at the round-trip-minimizing 16, so the HBM round-trip this fusion removes is already small on the critical path once step 0 is applied. The fusion's remaining lever is the reduced-occupancy problem (running the reduction on-chip lets you use fewer splits *without* idling SMs), not the round-trip per se — a subtler and smaller win than the isolated 48 µs suggested. Worth a design issue only if bs=1 attention becomes the dominant remaining cost after step 0; not a priority now. Not a drop-in — ThunderMLA is dense; this is a port onto the vendored sparse-FP8 kernel.

## Not worth it yet

Whole-layer megakernel fusion (glue the projections + norm + RoPE + cache-pack into one persistent kernel) — real but its instruction-set/counter-sync complexity is not justified before step 0 (arena+graph) and step 1 (sparse-ThunderMLA) are exhausted. See [[../../lessons/megakernels-for-decode-latency]].

## Next

Land step 0's arena (done in `feat/glm52-kernel-bench`) + graph capture (needs R550+ driver). Then open a design issue for the sparse-ThunderMLA port (step 1). Re-baseline after each with `glm52_kernel_bench`.
5 changes: 5 additions & 0 deletions openinfer-glm52/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ autobins = false
name = "glm52_load_weights"
path = "src/bin/glm52_load_weights.rs"

[[bin]]
name = "glm52_kernel_bench"
path = "src/bin/glm52_kernel_bench.rs"
required-features = ["glm52"]

[features]
default = []
glm52 = ["dep:openinfer-kernels", "openinfer-kernels/glm52"]
Expand Down
Loading