feat(dsv4): single-token decode config (B=8, S=1, T=8) by zhangqi-chen · Pull Request #638 · hw-native-sys/pypto-lib

zhangqi-chen · 2026-06-29T03:45:38Z

Summary

Switches the DeepSeek-V4 Flash decode config from MTP (B=4, S=2) to single-token decode (B=8, S=1), keeping T = B*S = 8 so the MoE pipeline width is unchanged.

The kernel-side enablement for small/single-token decode already lives on main — qkv_proj_rope pads T to MATMUL_T_TILE=16, decode_sparse_attn_swa zero-pads the MTP overlay up to ATTN_K_TILE, and #647 drives the MoE pipeline natively at T = MOE_TOKENS (prefill entries override to PREFILL_TOKENS). So the decode switch is now a pure config flip.

Also sizes the prefill MoE recv depth and auto-scopes the MoE so the real-weight ep8 prefill path runs end-to-end.

Changes

config (decode): DECODE_BATCH/SEQ → B=8, S=1 (T=8). Pure config flip; the kernel enablement is already on main.
config (prefill): PREFILL_RECV_MAX = 1024. With real gate weights the routing skews well past the RECV_SAFETY=4 uniform bound, so the formula-derived depth (96) overflows and deadlocks dispatch/combine on real-weight ep8 prefill. A bisection on the real-weight ep8 path puts the minimal passing depth in (768, 1024].
moe: mark moe @pl.jit.inline(auto_scope=False) and wrap expert_routed + combine + hc_post in a plain pl.scope() so the compiler places AUTO runtime scopes across the whole MoE instead of one hand-placed pl.scope(mode=MANUAL). The auto placement recycles the large recv buffers per stage and spreads the MoE across two rings, dropping the worst-ring MoE ring-heap high-water from 126% (unscoped) / ~90% (single MANUAL scope) to ~71% of the 1G ring (measured via scope_stats), keeping the larger RECV_MAX within budget.

Validation (a2a3)

decode_fwd.py / prefill_fwd.py ep2 (T=8 / T=128): run end-to-end.
decode_fwd.py / prefill_fwd.py ep8 (cards 8–15), real W8A8 weights: run end-to-end. Real-weight prefill needs RECV_MAX=1024 (else dispatch/combine deadlock); the default 1G ring heap suffices with the MoE auto-scope.
prefill_fwd.py ep2 re-run after the auto-scope change: PASS, scope_stats worst-ring heap high-water 90.6% → 71.2%, all rings fatal=False / dropped=0.

Notes

decode_fwd / prefill_fwd have no built-in golden, so their "PASS" = ran-to-completion.
prefill_fwd compile emits non-fatal sparse_blk_*__phi_v7 not found in MLIR mapping logs from the prefill sparse-attn codegen (pre-existing, unrelated to this change); compile + runtime both complete.

coderabbitai · 2026-06-29T03:45:46Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1fcb60d1-043a-44b7-973c-22f4660773a4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request updates the DeepSeek v4 model configuration and kernels to support single-token decoding by setting DECODE_SEQ = 1 and reducing several tile sizes from 128 to 64. It also adapts the sliding window attention (decode_sparse_attn_swa.py) to handle overlay sizes T smaller than ATTN_K_TILE by zero-padding the overlay. Feedback on these changes suggests optimizing the padding logic to conditionally execute only when T < ATTN_K_TILE, avoiding unnecessary memory bandwidth and SPMD launch overhead on the main production path where T == ATTN_K_TILE.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Switch the decode config from MTP (B=4, S=2) to single-token decode (B=8, S=1), keeping T = B*S = 8 so the MoE pipeline width is unchanged. The kernel-side enablement for small/single-token decode already lives on main: qkv_proj_rope pads T to MATMUL_T_TILE=16, decode_sparse_attn_swa zero-pads the MTP overlay up to ATTN_K_TILE, and hw-native-sys#647 drives the MoE pipeline natively at T = MOE_TOKENS (prefill entries override to PREFILL_TOKENS). So this is now a pure config flip. Validated a2a3 ep2: decode_fwd (T=8) and prefill_fwd (T=128) run end-to-end.

… MoE ring-heap config: PREFILL_RECV_MAX=1024 -- real gate routing skews past the RECV_SAFETY=4 uniform bound, overflowing the default recv depth and deadlocking dispatch/combine on real-weight ep8 prefill. moe: mark `moe` @pl.jit.inline(auto_scope=False) and use a plain pl.scope() around expert_routed + combine + hc_post so the compiler places AUTO runtime scopes across the whole MoE instead of one hand-placed MANUAL scope. The auto placement recycles the large recv buffers per stage and spreads the MoE over two rings, dropping the worst-ring MoE ring-heap high-water from 126% (unscoped) / ~90% (single MANUAL scope) to ~71% (keeps the larger RECV_MAX within the 1G ring).

## Summary Splits the **MoE auto-scope** change out of #638 so it can land independently of that PR's decode config flip (`B=8, S=1`) and prefill `RECV_MAX` sizing. Mark `moe` `@pl.jit.inline(auto_scope=False)` and wrap `expert_routed` + `combine` + `hc_post` in a plain `pl.scope()` so the compiler places AUTO runtime scopes across the whole MoE instead of one hand-placed `pl.scope(mode=MANUAL)`. The auto placement recycles the large recv buffers per stage and spreads the MoE across two rings, dropping the **worst-ring** MoE ring-heap high-water from **126%** (unscoped) / **~90%** (single MANUAL scope) to **~71%** of the 1G ring (measured via `scope_stats`). This keeps a larger `RECV_MAX` within the ring budget. ## Changes - **moe.py**: `moe` → `@pl.jit.inline(auto_scope=False)`; wrap `expert_routed` + `combine` + `hc_post` in `pl.scope()`. ## Validation (a2a3) As validated in #638: `prefill_fwd.py` ep2 re-run after the auto-scope change PASSes, `scope_stats` worst-ring heap high-water 90.6% → 71.2%, all rings `fatal=False / dropped=0`. Real-weight ep8 prefill (cards 8-15) runs end-to-end with the default 1G ring heap.

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread models/deepseek/v4/decode_sparse_attn_swa.py Outdated

zhangqi-chen force-pushed the dsv4-decode-s1 branch 2 times, most recently from 9734429 to bb43b9f Compare June 30, 2026 08:39

zhangqi-chen force-pushed the dsv4-decode-s1 branch from bb43b9f to 2a4828d Compare June 30, 2026 17:55

zhangqi-chen changed the title ~~feat(dsv4): support single-token decode (B=64, S=1, T=64)~~ feat(dsv4): single-token decode config (B=8, S=1, T=8) Jun 30, 2026

zhangqi-chen force-pushed the dsv4-decode-s1 branch from 2a4828d to d184688 Compare July 1, 2026 02:20

zhangqi-chen mentioned this pull request Jul 2, 2026

feat(dsv4 moe): auto-scope MoE to lower ring-heap high-water #668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dsv4): single-token decode config (B=8, S=1, T=8)#638

feat(dsv4): single-token decode config (B=8, S=1, T=8)#638
zhangqi-chen wants to merge 2 commits into
hw-native-sys:mainfrom
zhangqi-chen:dsv4-decode-s1

zhangqi-chen commented Jun 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zhangqi-chen commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation (a2a3)

Notes

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhangqi-chen commented Jun 29, 2026 •

edited

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading