perf(dsv4): keep resident weights across all multi-card cases#687
perf(dsv4): keep resident weights across all multi-card cases#687YunjiQin wants to merge 1 commit into
Conversation
Extend the child_memory resident-weight optimization (already used by prefill_fwd) to every remaining multi-card DeepSeek-V4 case, so static weight shards upload to their card once and skip the per-dispatch H2D/D2H. - decode_fwd: RESIDENT_WEIGHT_NAMES from the layer-stacked name lists minus CACHE_POOL_NAMES, plus RoPE tables + head/final-norm weights - decode_layer / prefill_layer: mark stacked attention + compressor + MoE weights (and the static tid2eid table) resident="stacked" - moe: mark routed/shared expert weights, gate, HC-FFN, norm, tid2eid - lm_head: mark the TP-sharded lm_head_weight resident="stacked" Excludes KV/state caches, per-step metadata (slot mappings, block tables, ids, position_ids, sparse indices), input activations, and outputs. All resident names are inputs (is_output=False).
📝 WalkthroughWalkthroughThis PR adds device-residency tagging to tensor spec builders across five DeepSeek v4 model files. Each ChangesDevice-resident weight tagging
Estimated code review effort: 2 (Simple) | ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request configures static weight parameters to be device-resident and sharded per rank (using resident='stacked') across several DeepSeek v4 model components, including decode_fwd, decode_layer, lm_head, moe, and prefill_layer. This optimization avoids per-dispatch host-to-device and device-to-host transfers. Feedback is provided to simplify the definition of RESIDENT_WEIGHT_NAMES in decode_fwd.py by removing redundant unpacking of CSA_LAYER_STACKED_NAMES and HCA_LAYER_STACKED_NAMES, which are already contained within LAYER_STACKED_NAMES.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| RESIDENT_WEIGHT_NAMES = frozenset( | ||
| [ | ||
| n | ||
| for n in (*LAYER_STACKED_NAMES, *CSA_LAYER_STACKED_NAMES, *HCA_LAYER_STACKED_NAMES) | ||
| if n not in CACHE_POOL_NAMES | ||
| ] | ||
| + ["freqs_cos", "freqs_sin"] | ||
| + HC_HEAD_NAMES | ||
| + FINAL_NORM_NAMES | ||
| ) |
There was a problem hiding this comment.
Unpacking CSA_LAYER_STACKED_NAMES and HCA_LAYER_STACKED_NAMES is redundant here because LAYER_STACKED_NAMES already contains all of those names. Simplifying the list comprehension improves readability and avoids unnecessary unpacking operations.
| RESIDENT_WEIGHT_NAMES = frozenset( | |
| [ | |
| n | |
| for n in (*LAYER_STACKED_NAMES, *CSA_LAYER_STACKED_NAMES, *HCA_LAYER_STACKED_NAMES) | |
| if n not in CACHE_POOL_NAMES | |
| ] | |
| + ["freqs_cos", "freqs_sin"] | |
| + HC_HEAD_NAMES | |
| + FINAL_NORM_NAMES | |
| ) | |
| RESIDENT_WEIGHT_NAMES = frozenset( | |
| [ | |
| n | |
| for n in LAYER_STACKED_NAMES | |
| if n not in CACHE_POOL_NAMES | |
| ] | |
| + ["freqs_cos", "freqs_sin"] | |
| + HC_HEAD_NAMES | |
| + FINAL_NORM_NAMES | |
| ) |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
models/deepseek/v4/moe.py (1)
898-919: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winPost-hoc
residentassignment bypassesTensorSpec's output/resident guard.
spec.resident = "stacked"is set after construction, soTensorSpec.__post_init__'s check that rejectsresidentcombined withis_output=Truenever re-runs.x_next(is_output=True) is already inspecswhen this loop executes, so a future accidental name collision (e.g. a rename/typo addingx_nexttoRESIDENT_WEIGHT_NAMES) would silently mark an output tensor resident instead of raising, per the documented contract: "resident is only valid for inputs (a resident weight stays device-resident across dispatches); it cannot be combined with is_output=True."A cheap guard restores the intended safety net. Based on learnings, module-level
assertguards are an accepted convention in this codebase's DeepSeek v4 modules and aren't stripped at runtime.🛡️ Proposed defensive check
for spec in specs: if spec.name in RESIDENT_WEIGHT_NAMES: + assert not spec.is_output, f"{spec.name!r} is an output tensor; cannot mark resident" spec.resident = "stacked"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/moe.py` around lines 898 - 919, The post-construction resident update in the specs loop bypasses TensorSpec’s is_output/resident validation, so add a defensive module-level assert before setting spec.resident to stacked in the RESIDENT_WEIGHT_NAMES loop. Use TensorSpec and the RESIDENT_WEIGHT_NAMES assignment block in moe.py to ensure no entry like x_next (or any future typo/rename collision) can ever be marked resident when spec.is_output is true, preserving the constructor’s guard.Source: Learnings
models/deepseek/v4/prefill_layer.py (1)
836-871: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winSame validation-bypass gap as
moe.py'sbuild_tensor_specs.
x_next(is_output=True, appended at line 836) is already present intensor_specswhen theRESIDENT_WEIGHT_NAMESloop runs at lines 868-870, so a future name collision would silently setresidenton an output tensor without triggeringTensorSpec's constructor-time guard againstresident+is_output=True.🛡️ Proposed defensive check
for spec in tensor_specs: if spec.name in RESIDENT_WEIGHT_NAMES: + assert not spec.is_output, f"{spec.name!r} is an output tensor; cannot mark resident" spec.resident = "stacked"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/prefill_layer.py` around lines 836 - 871, The `x_next` output can still be matched by the `RESIDENT_WEIGHT_NAMES` loop and have `resident="stacked"` assigned after construction, bypassing the `TensorSpec` output/resident validation. Update the `tensor_specs` post-processing in `prefill_layer.py` so only non-output specs are eligible for resident marking, or add an explicit guard before setting `spec.resident`; use the `x_next`, `RESIDENT_WEIGHT_NAMES`, and `TensorSpec` symbols to keep the check aligned with the constructor rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@models/deepseek/v4/moe.py`:
- Around line 898-919: The post-construction resident update in the specs loop
bypasses TensorSpec’s is_output/resident validation, so add a defensive
module-level assert before setting spec.resident to stacked in the
RESIDENT_WEIGHT_NAMES loop. Use TensorSpec and the RESIDENT_WEIGHT_NAMES
assignment block in moe.py to ensure no entry like x_next (or any future
typo/rename collision) can ever be marked resident when spec.is_output is true,
preserving the constructor’s guard.
In `@models/deepseek/v4/prefill_layer.py`:
- Around line 836-871: The `x_next` output can still be matched by the
`RESIDENT_WEIGHT_NAMES` loop and have `resident="stacked"` assigned after
construction, bypassing the `TensorSpec` output/resident validation. Update the
`tensor_specs` post-processing in `prefill_layer.py` so only non-output specs
are eligible for resident marking, or add an explicit guard before setting
`spec.resident`; use the `x_next`, `RESIDENT_WEIGHT_NAMES`, and `TensorSpec`
symbols to keep the check aligned with the constructor rule.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a1e986e5-672d-4ce8-b063-818ab6148a26
📒 Files selected for processing (5)
models/deepseek/v4/decode_fwd.pymodels/deepseek/v4/decode_layer.pymodels/deepseek/v4/lm_head.pymodels/deepseek/v4/moe.pymodels/deepseek/v4/prefill_layer.py
Summary
prefill_fwd) to every remaining multi-card DeepSeek-V4 case, so static weight shards upload to their card once and skip the per-dispatch H2D/D2H.decode_fwd: buildRESIDENT_WEIGHT_NAMESfrom the layer-stacked name lists minusCACHE_POOL_NAMES, plus RoPE tables + head/final-norm weights.decode_layer/prefill_layer: mark stacked attention + compressor + MoE weights (and the statictid2eidtable)resident="stacked".moe: mark routed/shared expert weights, gate, HC-FFN, norm, andtid2eidresident.lm_head: mark the TP-shardedlm_head_weightresident="stacked".is_output=False).Related Issues
N/A