Skip to content

Conversation

@nvrohanv
Copy link
Contributor

@nvrohanv nvrohanv commented Oct 25, 2025

Restructured example layout, separated out agg cases from disagg for cleaner directory tree. In future each example config should have its own directory with a deploy.yaml and perf.yaml. I added the DeepSeek-R1 wide-ep gb200 recipe in this format. After my changes there are still slurm examples that need to be migrated.

Cleaned up model-download flow, now model-cache directory contains:

  • model-cache.yaml — defines a PVC named model-cache-pvc

  • model-download.yaml — performs the model download into the PVC

Several changes were made within model-download.yaml:

  • Updated volume mount names to consistently use model-cache.

  • shifted hf download command to hf download <model> --local-dir /model-cache to avoid dealing with model-revisions and needing to figure out snapshot id values

# 4096 = 256 * 16
# moe_max_num_tokens: 4096
load_balancer: /mnt/recipes/deepseek-r1/trtllm/wide_ep/eplb.yaml
load_balancer: /mnt/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assuming some specific PVC mounting path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is actually coming from the slurm guides, in a separate pr we should remove those and turn all of these into k8s examples (if we want to keep slurm examples we need to fully separate it out)

values:
- "true"
mainContainer:
image: rohanv672/dynamo:0.5.1-trtllm-ssh
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to update this obviously 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, testing with actual dynamo container today

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3000
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably fix this to a more reasonable number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats a good value you think, depending on pvc speed it can end up taking around an hour, I'm thinking 1.5 hours with a comment saying its dependent on your pvc speed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to around 2 hrs and added comment about slow pvc potentially requiring update of probe

requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary given we also specify it in the limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

metadata:
name: trtllm-test-compute-domain
spec:
numNodes: 9
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment here and in prefill/decode that these need to match.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or ideally that you don't need to specify it ahead of time and the compute domain will grow/shrink accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once grove adds in support for automatic dra we'll remove this, will add in the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

name: trtllm-disagg-multinode
spec:
pvcs:
- name: modelcache-pvc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need the PVC here and in each component? Also where is this PVC defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya will be adding in the same model_downloader setup as biswas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added model-downloader for pvc, also now i dont mount pvc into frontend

@nvrohanv nvrohanv changed the title restructure dsr1 recipes and add gb200 feat: restructure dsr1 recipes and add gb200 Oct 30, 2025
@github-actions github-actions bot added the feat label Oct 30, 2025
@nvrohanv nvrohanv force-pushed the nvrohanv/add-dsr1-gb200-widep-recipe branch from 9bb62ad to 1692c43 Compare October 30, 2025 11:14
@nvrohanv nvrohanv marked this pull request as ready for review October 30, 2025 11:26
@nvrohanv nvrohanv requested review from a team as code owners October 30, 2025 11:26
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

The changes reorganize TRTLLM configuration file paths from a flat structure to hierarchical aggregated/disaggregated subdirectories, rename model cache PVC references for consistency, add a Kubernetes model-download Job, and introduce new disaggregated deployment configurations for GB200 systems.

Changes

Cohort / File(s) Summary
Configuration Path Reorganization
docs/backends/trtllm/README.md, docs/backends/trtllm/multinode/multinode-examples.md, examples/basics/multinode/trtllm/srun_aggregated.sh, examples/basics/multinode/trtllm/srun_disaggregated.sh
Updated config file paths from flat structure (wide_ep/) to hierarchical organization separating aggregated (agg/wide_ep/) and disaggregated (disagg/wide_ep/) configurations across deployment scripts and documentation
PVC and Mount Path Updates
recipes/deepseek-r1/model-cache/model-cache.yaml, recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml, recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml
Renamed PVC from model-cache to model-cache-pvc, increased storage from 1000Gi to 1500Gi, updated mount paths from /root/.cache/huggingface to /model-cache, and adjusted model-path arguments
Model Cache Management
recipes/deepseek-r1/model_cache/model-download.yaml
Added Kubernetes Job to orchestrate model downloads via huggingface_hub, installing dependencies and downloading models to persistent volume with retry and completion settings
Load Balancer Configuration Updates
recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml, recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml, recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml, recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml
Updated load_balancer paths to reflect new directory structure; added new eplb.yaml for disaggregated wide EP configuration with MOE load balancer settings
Disaggregated Deployment Manifests
recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml, recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml
Introduced new multi-node disaggregated deployment and benchmark Job configurations for GB200 systems with prefill/decode separation, ConfigMaps for engine configs, DynamoGraphDeployment orchestration, and comprehensive performance benchmarking infrastructure

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Path consistency verification: Confirm all references to agg/wide_ep/ and disagg/wide_ep/ are applied uniformly across configuration files and documentation
  • PVC naming propagation: Validate that model-cache-pvc is consistently referenced in all volume mounts and claims across deployment manifests
  • New GB200 manifests: Review correctness of disaggregated deployment topology (prefill single-node, decode multi-node with 8 nodes), ConfigMap structure, and ComputeDomain configuration
  • Benchmark job logic: Assess the perf.yaml Job's shell scripting, aiperf integration, polling mechanisms, and environment variable handling for robustness
  • Model cache Job: Ensure model-download.yaml correctly mounts PVC and executes download commands without issues

Poem

🐰 Paths reorganized, now clean and neat,
Agg and disagg, a structured treat!
PVCs renamed, configs aligned,
GB200 awaits, benchmarks designed!

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ❓ Inconclusive The pull request description provides detailed technical content explaining the restructuring of the example layout, separation of aggregated and disaggregated cases, and specific changes to model-cache and model-download files. However, the description does not follow the required template structure with clearly labeled sections. Specifically, the description lacks the template headings (Overview, Details, Where should the reviewer start?, Related Issues), does not provide guidance on which specific files reviewers should prioritize, and does not reference any related GitHub issues using the required action keywords (Closes/Fixes/Resolves/Relates to). While the technical content is informative and substantive, the structural deviation from the template and missing guidance sections make it unclear whether this submission meets the repository's documentation standards. The author should restructure the description to follow the provided template format with clear section headings. Specifically, they should: add an "Overview" section summarizing the PR purpose, provide a "Details" section with the change descriptions, add a "Where should the reviewer start?" section identifying critical files for review (such as recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml and the new model-download.yaml), and include a "Related Issues" section with the actual GitHub issue number using appropriate action keywords if applicable.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: restructure dsr1 recipes and add gb200" accurately reflects the two major themes of the changeset. The raw summary shows extensive restructuring of DeepSeek-R1 recipe paths through new agg/ and disagg/ subdirectories, plus addition of new GB200-specific deployment files (gb200/deploy.yaml and gb200/perf.yaml). The title is specific enough that a developer scanning history would understand the PR involves recipe reorganization and GB200 support, avoiding vague terms like "misc updates."

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
recipes/deepseek-r1/model_cache/model-download.yaml (1)

1-37: Approve with note on root user security posture.

The model-download Job is well-structured and the hardcoded model names (nvidia/DeepSeek-R1-FP4 and deepseek-ai/DeepSeek-R1) follow the intended recipe-specific pattern per learnings. The PVC naming is consistent, and downloading both FP4 and standard variants provides flexibility for different hardware (GB200 vs. Hopper as noted in documentation). However, Checkov flags CKV_K8S_23 and CKV_K8S_20 for running as root. While this is acceptable for a one-time batch download job, consider running as a non-root user for improved security posture.

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml (2)

76-78: Add clarifying comment for decode nodeCount dependency.

Lines 76–78 appropriately document that ComputeDomain.numNodes must equal prefill.nodeCount + decode.nodeCount. While this is correctly calculated as 1 + 8 = 9 (line 84), the decode section (line 192) lacks a similar comment explaining that its nodeCount: 8 contributes to the ComputeDomain total. For consistency and to aid future maintainers, add a brief comment above line 192 as well.

Add this comment above line 192:

      multinode:
        # NOTE: This nodeCount contributes to ComputeDomain.numNodes (see above)
        nodeCount: 8

Also applies to: 76-192


155-162: Startup probe configuration is appropriate but document the intent.

Both prefill (lines 155–162) and decode (lines 213–220) services use identical startup probe settings with failureThreshold: 500 (~83 minutes at 10-second intervals). A past review suggested adding a comment explaining why these values match and their relationship to PVC speed. Consider adding a clarifying comment to line 155 (or above the prefill startupProbe block):

          # NOTE: Prefill and decode startupProbe thresholds must match.
          #       With PVC latency, model loading can take considerable time.
          #       Adjust failureThreshold based on your PVC performance.
          startupProbe:

Also applies to: 213-220

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bec8b3 and 3ed80d1.

📒 Files selected for processing (14)
  • docs/backends/trtllm/README.md (1 hunks)
  • docs/backends/trtllm/multinode/multinode-examples.md (2 hunks)
  • examples/basics/multinode/trtllm/srun_aggregated.sh (1 hunks)
  • examples/basics/multinode/trtllm/srun_disaggregated.sh (1 hunks)
  • recipes/deepseek-r1/model-cache/model-cache.yaml (1 hunks)
  • recipes/deepseek-r1/model_cache/model-download.yaml (1 hunks)
  • recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml (5 hunks)
  • recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml (5 hunks)
  • recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml (1 hunks)
  • recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml (1 hunks)
  • recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml (1 hunks)
  • recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml (1 hunks)
  • recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml (1 hunks)
  • recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.
Learnt from: biswapanda
PR: ai-dynamo/dynamo#3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.
📚 Learning: 2025-07-31T11:26:48.422Z
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.

Applied to files:

  • docs/backends/trtllm/README.md
  • recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml
  • docs/backends/trtllm/multinode/multinode-examples.md
  • recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml
  • recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml
  • recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml
📚 Learning: 2025-07-02T13:20:28.800Z
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.

Applied to files:

  • docs/backends/trtllm/multinode/multinode-examples.md
  • examples/basics/multinode/trtllm/srun_disaggregated.sh
📚 Learning: 2025-10-24T04:21:08.751Z
Learnt from: biswapanda
PR: ai-dynamo/dynamo#3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.

Applied to files:

  • recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml
  • recipes/deepseek-r1/model_cache/model-download.yaml
  • recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

  • recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml
📚 Learning: 2025-06-05T01:46:15.509Z
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Applied to files:

  • examples/basics/multinode/trtllm/srun_aggregated.sh
🪛 Checkov (3.2.334)
recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml

[medium] 3-153: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 3-153: Container should not be privileged

(CKV_K8S_16)


[medium] 3-153: Minimize the admission of root containers

(CKV_K8S_23)

recipes/deepseek-r1/model_cache/model-download.yaml

[medium] 3-37: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 3-37: Minimize the admission of root containers

(CKV_K8S_23)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3891/merge) by nvrohanv.
recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

[error] 73-73: Trailing whitespace found and removed by pre-commit. (pre-commit: trailing-whitespace)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: operator (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: Mirror Repository to GitLab
🔇 Additional comments (14)
examples/basics/multinode/trtllm/srun_disaggregated.sh (1)

20-20: Path reorganization for disaggregated deployments looks good.

The default engine config paths now clearly reflect the disaggregated subdirectory structure (disagg/wide_ep/), making the distinction from aggregated configs explicit and improving discoverability.

Also applies to: 24-24

examples/basics/multinode/trtllm/srun_aggregated.sh (1)

21-21: Export addition and path reorganization are appropriate.

Explicitly exporting ENGINE_CONFIG ensures the variable is properly passed to worker containers via the --container-env flag (line 66), improving clarity and reliability of environment propagation in multi-node setups.

recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml (1)

20-20: Load balancer path reorganization is consistent with disaggregated layout.

The path update to disagg/wide_ep/eplb.yaml aligns with the new directory structure, and the WIDEEP backend field (line 19) is correctly in all caps per TRTLLM LLM-API requirements.

docs/backends/trtllm/README.md (1)

165-165: Documentation path update maintains consistency with reorganized config structure.

The AGG_ENGINE_ARGS path update correctly reflects the new aggregated subdirectory layout (agg/mtp/), ensuring users follow current file organization when running examples.

recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml (1)

20-20: Load balancer path correctly reflects disaggregated config organization.

Mirrors the prefill config structure with disagg/wide_ep/eplb.yaml path, ensuring consistent load balancer reference across disaggregated deployment components.

recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml (1)

10-10: Deployment manifest consistently updates PVC reference and model paths.

All PVC references now use model-cache-pvc (consistent with the renamed PVC), and model paths correctly reference the mount point /model-cache/deepseek-r1. Changes are consistently applied to both decode and prefill workers.

Also applies to: 37-38, 58-58, 90-91, 111-111

recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml (1)

14-14: Aggregated config load balancer path correctly reorganized.

The path update to agg/wide_ep/eplb.yaml aligns with the aggregated subdirectory structure, mirroring the disaggregated pattern and improving config file organization and discoverability.

recipes/deepseek-r1/model-cache/model-cache.yaml (1)

6-6: PVC rename is incomplete—stale references remain across recipes; clarify scope and verify deepseek-r1 manifests are fully updated.

The verification reveals the PVC rename to model-cache-pvc is incomplete. While only 12 references to model-cache-pvc exist, stale references to the old model-cache name persist across multiple recipes (qwen3-32b-fp8, llama-3-70b, gpt-oss-120b, and deepseek-r1 itself).

Confirm whether this PR intends to:

  1. Update only deepseek-r1 manifests (then verify all deepseek-r1 files reference model-cache-pvc)
  2. Update all recipes consistently (then address stale references in qwen3-32b-fp8, llama-3-70b, gpt-oss-120b, and remaining deepseek-r1 files)

Cross-recipe inconsistency risks deployment failures if some manifests reference the old name.

recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml (1)

1-7: LGTM on structure and licensing.

The new EPLB configuration file is clean, properly licensed, and the MOE load balancer settings (num_slots: 288, layer_updates_per_iter: 2) align with the TensorRT-LLM documentation referenced in the comment. This externalization of the load balancer config from inline definitions is a good refactor for maintainability.

recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml (1)

10-10: Verify consistency of model cache directory naming.

The PVC name is correctly updated to model-cache-pvc (line 10), and mount paths consistently use /model-cache (lines 35–36, 84–85). However, the --model-path arguments reference /model-cache/deepseek-r1 (lines 56, 105) while the model-download job (recipes/deepseek-r1/model_cache/model-download.yaml) downloads to both /model-cache/deepseek-r1-fp4 and /model-cache/deepseek-r1. Confirm whether this is intentional (using the non-FP4 variant) or should match the FP4 path used in other deployments.

Also applies to: 35-36, 84-85, 56-56, 105-105

docs/backends/trtllm/multinode/multinode-examples.md (1)

139-139: Documentation paths correctly aligned with new structure.

The path updates properly reflect the reorganization into agg/wide_ep/ and disagg/wide_ep/ subdirectories. The example commands (lines 139, 168–169) will guide users to the correct configuration files under the new hierarchy. No functional issues detected.

Also applies to: 168-169

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml (1)

59-59: Align hardcoded tokenizer path with model-download outputs.

Line 59 hardcodes the tokenizer path to /model-cache/deepseek-r1-fp4, but the model-download job (recipes/deepseek-r1/model_cache/model-download.yaml) downloads models to both /model-cache/deepseek-r1-fp4 and /model-cache/deepseek-r1. Additionally, the deployment manifest (recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) references /model-cache/deepseek-r1 for the model path. Confirm whether the tokenizer should be parameterized (via an environment variable or ConfigMap) or hardcoded to match the deployment's model path.

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml (2)

119-119: Container image requires confirmation.

Line 119 references nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.0. A past review comment indicated this placeholder "will need to update," and there's a corresponding author comment stating "testing with actual dynamo container today." Confirm that this image tag is finalized and available in the registry before merge, or document if it's intentionally a placeholder for manual deployment configuration.


69-69: Backend field correctly uses all-caps WIDEEP.

Line 69 correctly specifies backend: WIDEEP (all capitals) in the decode config. This aligns with TRTLLM LLM-API requirements for backend field names. Based on learnings
, this is the correct format.

cache_transceiver_config:
max_tokens_in_buffer: 4608
backend: DEFAULT
stream_interval: 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix trailing whitespace on line 73.

The pre-commit hook flagged trailing whitespace at line 73. This is in the decode-config ConfigMap YAML. Remove any trailing spaces on this line before merge.

🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3891/merge) by nvrohanv.

[error] 73-73: Trailing whitespace found and removed by pre-commit. (pre-commit: trailing-whitespace)

🤖 Prompt for AI Agents
In recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml around line 73,
there is trailing whitespace after "stream_interval: 20"; open the file, remove
any spaces or tabs after the value on line 73 so the line ends immediately after
"20", save the file, and run the pre-commit hook or lint to confirm the trailing
whitespace is gone.

imagePullPolicy: IfNotPresent
name: perf
securityContext:
privileged: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Remove privileged mode unless strictly necessary for performance testing.

The container runs with securityContext.privileged: true (line 142), which is flagged as a high-severity security risk (CKV_K8S_16). While the network tuning at line 34 (sysctl -w net.ipv4.ip_local_port_range) may require elevated privileges, consider whether this can be accomplished via allowPrivilegeEscalation and targeted Linux capabilities instead, or defer to the host network configuration.

Is the privileged mode strictly necessary, or can network tuning be handled outside the container or via setLinuxCapabilities?


I'll verify this review comment by examining the actual file and understanding the security context requirements.






🏁 Script executed:

#!/bin/bash
# Find and examine the perf.yaml file
fd -t f "perf.yaml" | head -20

Length of output: 334


🏁 Script executed:

#!/bin/bash
# If found, examine the security context and network tuning sections
fd -t f "perf.yaml" -x cat {}

Length of output: 41288


🏁 Script executed:

#!/bin/bash
# Search for securityContext patterns in Job specs across the codebase
rg -n "securityContext:" -A 5 -B 2 --type=yaml | grep -A 5 -B 2 "privileged"

Length of output: 3917


🏁 Script executed:

#!/bin/bash
# Search for other network tuning or sysctl usage patterns
rg -n "sysctl" -B 5 -A 5 --type=yaml

Length of output: 22978



---

🌐 Web query:

```web
Kubernetes pod securityContext sysctls net.ipv4.ip_local_port_range namespace restriction

💡 Result:

Short answer

  • net.ipv4.ip_local_port_range is in Kubernetes' "safe" (namespaced) sysctls and may be set per-Pod via spec.securityContext.sysctls. [1]
  • It cannot be used when the Pod uses hostNetwork (net namespace must be namespaced). [1]
  • Other sysctls that are not namespaced (node-level) cannot be set in pod securityContext; unsafe sysctls must be explicitly allowed on each node with kubelet's --allowed-unsafe-sysctls. [1]
  • Use admission policies (e.g., Kyverno/PSP replacements) to restrict which sysctls (including allowing only net.ipv4.ip_local_port_range) are permitted in your cluster. [2]

Relevant docs

  • Kubernetes docs: "Using sysctls in a Kubernetes Cluster" (safe vs unsafe, namespaced sysctls, hostNetwork rule, kubelet flag). [1]
  • Example policy enforcement (Kyverno) showing restricting pods to the safe sysctl set including net.ipv4.ip_local_port_range. [2]

Replace container-level privileged: true with pod-level sysctls to eliminate unnecessary privilege escalation.

The container runs with securityContext.privileged: true (line 142) to execute sysctl -w net.ipv4.ip_local_port_range="1024 65000". However, net.ipv4.ip_local_port_range is a namespaced (safe) sysctl in Kubernetes and can be set at the pod level via spec.securityContext.sysctls without requiring privileged mode.

Replace the privileged security context with:

spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    sysctls:
    - name: net.ipv4.ip_local_port_range
      value: "1024 65000"

Remove the sysctl command from the container startup script. This pattern applies to all perf.yaml files in the repository (recipes/*/perf.yaml).

🤖 Prompt for AI Agents
In recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml around line 142,
the container securityContext sets privileged: true solely to run sysctl -w
net.ipv4.ip_local_port_range="1024 65000"; remove privileged: true and instead
add a pod-level spec.securityContext with runAsUser: 1000, runAsGroup: 1000,
fsGroup: 1000 and a sysctls list containing name: net.ipv4.ip_local_port_range
and value: "1024 65000"; also remove the sysctl command from the container
startup script; apply the same change to all recipes/*/perf.yaml files.

@nvrohanv nvrohanv force-pushed the nvrohanv/add-dsr1-gb200-widep-recipe branch from 8406768 to 6b83018 Compare October 30, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants