Skip to content

feat(dsv4): single-token decode config (B=8, S=1, T=8)#638

Draft
zhangqi-chen wants to merge 2 commits into
hw-native-sys:mainfrom
zhangqi-chen:dsv4-decode-s1
Draft

feat(dsv4): single-token decode config (B=8, S=1, T=8)#638
zhangqi-chen wants to merge 2 commits into
hw-native-sys:mainfrom
zhangqi-chen:dsv4-decode-s1

Conversation

@zhangqi-chen

@zhangqi-chen zhangqi-chen commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Switches the DeepSeek-V4 Flash decode config from MTP (B=4, S=2) to single-token decode (B=8, S=1), keeping T = B*S = 8 so the MoE pipeline width is unchanged.

The kernel-side enablement for small/single-token decode already lives on main — qkv_proj_rope pads T to MATMUL_T_TILE=16, decode_sparse_attn_swa zero-pads the MTP overlay up to ATTN_K_TILE, and #647 drives the MoE pipeline natively at T = MOE_TOKENS (prefill entries override to PREFILL_TOKENS). So the decode switch is now a pure config flip.

Also sizes the prefill MoE recv depth and auto-scopes the MoE so the real-weight ep8 prefill path runs end-to-end.

Changes

  • config (decode): DECODE_BATCH/SEQB=8, S=1 (T=8). Pure config flip; the kernel enablement is already on main.
  • config (prefill): PREFILL_RECV_MAX = 1024. With real gate weights the routing skews well past the RECV_SAFETY=4 uniform bound, so the formula-derived depth (96) overflows and deadlocks dispatch/combine on real-weight ep8 prefill. A bisection on the real-weight ep8 path puts the minimal passing depth in (768, 1024].
  • moe: mark moe @pl.jit.inline(auto_scope=False) and wrap expert_routed + combine + hc_post in a plain pl.scope() so the compiler places AUTO runtime scopes across the whole MoE instead of one hand-placed pl.scope(mode=MANUAL). The auto placement recycles the large recv buffers per stage and spreads the MoE across two rings, dropping the worst-ring MoE ring-heap high-water from 126% (unscoped) / ~90% (single MANUAL scope) to ~71% of the 1G ring (measured via scope_stats), keeping the larger RECV_MAX within budget.

Validation (a2a3)

  • decode_fwd.py / prefill_fwd.py ep2 (T=8 / T=128): run end-to-end.
  • decode_fwd.py / prefill_fwd.py ep8 (cards 8–15), real W8A8 weights: run end-to-end. Real-weight prefill needs RECV_MAX=1024 (else dispatch/combine deadlock); the default 1G ring heap suffices with the MoE auto-scope.
  • prefill_fwd.py ep2 re-run after the auto-scope change: PASS, scope_stats worst-ring heap high-water 90.6% → 71.2%, all rings fatal=False / dropped=0.

Notes

  • decode_fwd / prefill_fwd have no built-in golden, so their "PASS" = ran-to-completion.
  • prefill_fwd compile emits non-fatal sparse_blk_*__phi_v7 not found in MLIR mapping logs from the prefill sparse-attn codegen (pre-existing, unrelated to this change); compile + runtime both complete.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1fcb60d1-043a-44b7-973c-22f4660773a4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the DeepSeek v4 model configuration and kernels to support single-token decoding by setting DECODE_SEQ = 1 and reducing several tile sizes from 128 to 64. It also adapts the sliding window attention (decode_sparse_attn_swa.py) to handle overlay sizes T smaller than ATTN_K_TILE by zero-padding the overlay. Feedback on these changes suggests optimizing the padding logic to conditionally execute only when T < ATTN_K_TILE, avoiding unnecessary memory bandwidth and SPMD launch overhead on the main production path where T == ATTN_K_TILE.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread models/deepseek/v4/decode_sparse_attn_swa.py Outdated
@zhangqi-chen zhangqi-chen force-pushed the dsv4-decode-s1 branch 2 times, most recently from 9734429 to bb43b9f Compare June 30, 2026 08:39
Switch the decode config from MTP (B=4, S=2) to single-token decode
(B=8, S=1), keeping T = B*S = 8 so the MoE pipeline width is unchanged.

The kernel-side enablement for small/single-token decode already lives on
main: qkv_proj_rope pads T to MATMUL_T_TILE=16, decode_sparse_attn_swa
zero-pads the MTP overlay up to ATTN_K_TILE, and hw-native-sys#647 drives the MoE
pipeline natively at T = MOE_TOKENS (prefill entries override to
PREFILL_TOKENS). So this is now a pure config flip.

Validated a2a3 ep2: decode_fwd (T=8) and prefill_fwd (T=128) run
end-to-end.
@zhangqi-chen zhangqi-chen changed the title feat(dsv4): support single-token decode (B=64, S=1, T=64) feat(dsv4): single-token decode config (B=8, S=1, T=8) Jun 30, 2026
… MoE ring-heap

config: PREFILL_RECV_MAX=1024 -- real gate routing skews past the RECV_SAFETY=4 uniform bound, overflowing the default recv depth and deadlocking dispatch/combine on real-weight ep8 prefill.

moe: mark `moe` @pl.jit.inline(auto_scope=False) and use a plain pl.scope() around expert_routed + combine + hc_post so the compiler places AUTO runtime scopes across the whole MoE instead of one hand-placed MANUAL scope. The auto placement recycles the large recv buffers per stage and spreads the MoE over two rings, dropping the worst-ring MoE ring-heap high-water from 126% (unscoped) / ~90% (single MANUAL scope) to ~71% (keeps the larger RECV_MAX within the 1G ring).
zhangqi-chen added a commit that referenced this pull request Jul 2, 2026
## Summary

Splits the **MoE auto-scope** change out of #638 so it can land
independently of that PR's decode config flip (`B=8, S=1`) and prefill
`RECV_MAX` sizing.

Mark `moe` `@pl.jit.inline(auto_scope=False)` and wrap `expert_routed` +
`combine` + `hc_post` in a plain `pl.scope()` so the compiler places
AUTO runtime scopes across the whole MoE instead of one hand-placed
`pl.scope(mode=MANUAL)`.

The auto placement recycles the large recv buffers per stage and spreads
the MoE across two rings, dropping the **worst-ring** MoE ring-heap
high-water from **126%** (unscoped) / **~90%** (single MANUAL scope) to
**~71%** of the 1G ring (measured via `scope_stats`). This keeps a
larger `RECV_MAX` within the ring budget.

## Changes

- **moe.py**: `moe` → `@pl.jit.inline(auto_scope=False)`; wrap
`expert_routed` + `combine` + `hc_post` in `pl.scope()`.

## Validation (a2a3)

As validated in #638: `prefill_fwd.py` ep2 re-run after the auto-scope
change PASSes, `scope_stats` worst-ring heap high-water 90.6% → 71.2%,
all rings `fatal=False / dropped=0`. Real-weight ep8 prefill (cards
8-15) runs end-to-end with the default 1G ring heap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant