Skip to content

perf(dsv4): keep resident weights across all multi-card cases#687

Open
YunjiQin wants to merge 1 commit into
hw-native-sys:mainfrom
YunjiQin:feat/v4-resident-weights-multicard
Open

perf(dsv4): keep resident weights across all multi-card cases#687
YunjiQin wants to merge 1 commit into
hw-native-sys:mainfrom
YunjiQin:feat/v4-resident-weights-multicard

Conversation

@YunjiQin

@YunjiQin YunjiQin commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Extend the child_memory resident-weight optimization (already used by prefill_fwd) to every remaining multi-card DeepSeek-V4 case, so static weight shards upload to their card once and skip the per-dispatch H2D/D2H.
  • decode_fwd: build RESIDENT_WEIGHT_NAMES from the layer-stacked name lists minus CACHE_POOL_NAMES, plus RoPE tables + head/final-norm weights.
  • decode_layer / prefill_layer: mark stacked attention + compressor + MoE weights (and the static tid2eid table) resident="stacked".
  • moe: mark routed/shared expert weights, gate, HC-FFN, norm, and tid2eid resident.
  • lm_head: mark the TP-sharded lm_head_weight resident="stacked".
  • Excludes KV/state caches, per-step metadata (slot mappings, block tables, ids, position_ids, sparse indices), input activations, and outputs. All resident names are inputs (is_output=False).

Related Issues

N/A

Extend the child_memory resident-weight optimization (already used by
prefill_fwd) to every remaining multi-card DeepSeek-V4 case, so static
weight shards upload to their card once and skip the per-dispatch H2D/D2H.

- decode_fwd: RESIDENT_WEIGHT_NAMES from the layer-stacked name lists
  minus CACHE_POOL_NAMES, plus RoPE tables + head/final-norm weights
- decode_layer / prefill_layer: mark stacked attention + compressor +
  MoE weights (and the static tid2eid table) resident="stacked"
- moe: mark routed/shared expert weights, gate, HC-FFN, norm, tid2eid
- lm_head: mark the TP-sharded lm_head_weight resident="stacked"

Excludes KV/state caches, per-step metadata (slot mappings, block
tables, ids, position_ids, sparse indices), input activations, and
outputs. All resident names are inputs (is_output=False).
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds device-residency tagging to tensor spec builders across five DeepSeek v4 model files. Each build_tensor_specs function now defines a RESIDENT_WEIGHT_NAMES set and marks matching static weight specs with spec.resident = "stacked", excluding KV/state caches and per-step metadata.

Changes

Device-resident weight tagging

Layer / File(s) Summary
Decode forward resident weights
models/deepseek/v4/decode_fwd.py
Defines RESIDENT_WEIGHT_NAMES covering stacked attention/MoE, RoPE cos/sin, head, and final-norm weights; build_tensor_specs marks matching specs resident="stacked".
Decode layer resident weights
models/deepseek/v4/decode_layer.py
Adds RESIDENT_WEIGHT_NAMES for attention and MoE FFN/gate params plus tid2eid; loop marks matching specs resident.
LM head weight residency
models/deepseek/v4/lm_head.py
Marks the TP-sharded lm_head_weight spec as resident="stacked" with explanatory comments.
MoE tensor spec residency
models/deepseek/v4/moe.py
Assigns spec list to a local specs variable, adds RESIDENT_WEIGHT_NAMES for FFN/gate/expert weights, marks matches resident, then returns specs.
Prefill layer resident weights
models/deepseek/v4/prefill_layer.py
Adds RESIDENT_WEIGHT_NAMES for attention/RoPE, HCA/CSA weights, output projection, and MoE static weights; loop marks matching specs resident.

Estimated code review effort: 2 (Simple) | ~12 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#640: Both PRs modify the tensor-spec construction path in models/deepseek/v4/decode_fwd.py's build_tensor_specs, one adding resident tagging and the other refactoring the same spec-building logic.

Poem

A rabbit hops through weights so fine,
Tagging each as "stacked" this time,
No more shuttling to and fro,
On-device now, they stay and glow.
🐇✨ Dispatch after dispatch, weights hold still,
Cache and metadata roam at will.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: extending resident-weight handling across DeepSeek-V4 multi-card paths.
Description check ✅ Passed The description matches the code changes and clearly explains the resident-weight optimization across the affected modules.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request configures static weight parameters to be device-resident and sharded per rank (using resident='stacked') across several DeepSeek v4 model components, including decode_fwd, decode_layer, lm_head, moe, and prefill_layer. This optimization avoids per-dispatch host-to-device and device-to-host transfers. Feedback is provided to simplify the definition of RESIDENT_WEIGHT_NAMES in decode_fwd.py by removing redundant unpacking of CSA_LAYER_STACKED_NAMES and HCA_LAYER_STACKED_NAMES, which are already contained within LAYER_STACKED_NAMES.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +143 to +152
RESIDENT_WEIGHT_NAMES = frozenset(
[
n
for n in (*LAYER_STACKED_NAMES, *CSA_LAYER_STACKED_NAMES, *HCA_LAYER_STACKED_NAMES)
if n not in CACHE_POOL_NAMES
]
+ ["freqs_cos", "freqs_sin"]
+ HC_HEAD_NAMES
+ FINAL_NORM_NAMES
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Unpacking CSA_LAYER_STACKED_NAMES and HCA_LAYER_STACKED_NAMES is redundant here because LAYER_STACKED_NAMES already contains all of those names. Simplifying the list comprehension improves readability and avoids unnecessary unpacking operations.

Suggested change
RESIDENT_WEIGHT_NAMES = frozenset(
[
n
for n in (*LAYER_STACKED_NAMES, *CSA_LAYER_STACKED_NAMES, *HCA_LAYER_STACKED_NAMES)
if n not in CACHE_POOL_NAMES
]
+ ["freqs_cos", "freqs_sin"]
+ HC_HEAD_NAMES
+ FINAL_NORM_NAMES
)
RESIDENT_WEIGHT_NAMES = frozenset(
[
n
for n in LAYER_STACKED_NAMES
if n not in CACHE_POOL_NAMES
]
+ ["freqs_cos", "freqs_sin"]
+ HC_HEAD_NAMES
+ FINAL_NORM_NAMES
)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
models/deepseek/v4/moe.py (1)

898-919: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Post-hoc resident assignment bypasses TensorSpec's output/resident guard.

spec.resident = "stacked" is set after construction, so TensorSpec.__post_init__'s check that rejects resident combined with is_output=True never re-runs. x_next (is_output=True) is already in specs when this loop executes, so a future accidental name collision (e.g. a rename/typo adding x_next to RESIDENT_WEIGHT_NAMES) would silently mark an output tensor resident instead of raising, per the documented contract: "resident is only valid for inputs (a resident weight stays device-resident across dispatches); it cannot be combined with is_output=True."

A cheap guard restores the intended safety net. Based on learnings, module-level assert guards are an accepted convention in this codebase's DeepSeek v4 modules and aren't stripped at runtime.

🛡️ Proposed defensive check
     for spec in specs:
         if spec.name in RESIDENT_WEIGHT_NAMES:
+            assert not spec.is_output, f"{spec.name!r} is an output tensor; cannot mark resident"
             spec.resident = "stacked"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/moe.py` around lines 898 - 919, The post-construction
resident update in the specs loop bypasses TensorSpec’s is_output/resident
validation, so add a defensive module-level assert before setting spec.resident
to stacked in the RESIDENT_WEIGHT_NAMES loop. Use TensorSpec and the
RESIDENT_WEIGHT_NAMES assignment block in moe.py to ensure no entry like x_next
(or any future typo/rename collision) can ever be marked resident when
spec.is_output is true, preserving the constructor’s guard.

Source: Learnings

models/deepseek/v4/prefill_layer.py (1)

836-871: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Same validation-bypass gap as moe.py's build_tensor_specs.

x_next (is_output=True, appended at line 836) is already present in tensor_specs when the RESIDENT_WEIGHT_NAMES loop runs at lines 868-870, so a future name collision would silently set resident on an output tensor without triggering TensorSpec's constructor-time guard against resident + is_output=True.

🛡️ Proposed defensive check
     for spec in tensor_specs:
         if spec.name in RESIDENT_WEIGHT_NAMES:
+            assert not spec.is_output, f"{spec.name!r} is an output tensor; cannot mark resident"
             spec.resident = "stacked"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/prefill_layer.py` around lines 836 - 871, The `x_next`
output can still be matched by the `RESIDENT_WEIGHT_NAMES` loop and have
`resident="stacked"` assigned after construction, bypassing the `TensorSpec`
output/resident validation. Update the `tensor_specs` post-processing in
`prefill_layer.py` so only non-output specs are eligible for resident marking,
or add an explicit guard before setting `spec.resident`; use the `x_next`,
`RESIDENT_WEIGHT_NAMES`, and `TensorSpec` symbols to keep the check aligned with
the constructor rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/moe.py`:
- Around line 898-919: The post-construction resident update in the specs loop
bypasses TensorSpec’s is_output/resident validation, so add a defensive
module-level assert before setting spec.resident to stacked in the
RESIDENT_WEIGHT_NAMES loop. Use TensorSpec and the RESIDENT_WEIGHT_NAMES
assignment block in moe.py to ensure no entry like x_next (or any future
typo/rename collision) can ever be marked resident when spec.is_output is true,
preserving the constructor’s guard.

In `@models/deepseek/v4/prefill_layer.py`:
- Around line 836-871: The `x_next` output can still be matched by the
`RESIDENT_WEIGHT_NAMES` loop and have `resident="stacked"` assigned after
construction, bypassing the `TensorSpec` output/resident validation. Update the
`tensor_specs` post-processing in `prefill_layer.py` so only non-output specs
are eligible for resident marking, or add an explicit guard before setting
`spec.resident`; use the `x_next`, `RESIDENT_WEIGHT_NAMES`, and `TensorSpec`
symbols to keep the check aligned with the constructor rule.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a1e986e5-672d-4ce8-b063-818ab6148a26

📥 Commits

Reviewing files that changed from the base of the PR and between 7d376c8 and dbb8ac6.

📒 Files selected for processing (5)
  • models/deepseek/v4/decode_fwd.py
  • models/deepseek/v4/decode_layer.py
  • models/deepseek/v4/lm_head.py
  • models/deepseek/v4/moe.py
  • models/deepseek/v4/prefill_layer.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant