feat: restructure dsr1 recipes and add gb200 #3891

nvrohanv · 2025-10-25T08:52:40Z

Restructured example layout, separated out agg cases from disagg for cleaner directory tree. In future each example config should have its own directory with a deploy.yaml and perf.yaml. I added the DeepSeek-R1 wide-ep gb200 recipe in this format. After my changes there are still slurm examples that need to be migrated.

Cleaned up model-download flow, now model-cache directory contains:

model-cache.yaml — defines a PVC named model-cache-pvc
model-download.yaml — performs the model download into the PVC

Several changes were made within model-download.yaml:

Updated volume mount names to consistently use model-cache.
shifted hf download command to hf download <model> --local-dir /model-cache to avoid dealing with model-revisions and needing to figure out snapshot id values

itay · 2025-10-25T14:58:38Z

recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml

  #   4096 = 256 * 16
  # moe_max_num_tokens: 4096
-  load_balancer: /mnt/recipes/deepseek-r1/trtllm/wide_ep/eplb.yaml
+  load_balancer: /mnt/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml


Is this assuming some specific PVC mounting path?

Yes, this is actually coming from the slurm guides, in a separate pr we should remove those and turn all of these into k8s examples (if we want to keep slurm examples we need to fully separate it out)

itay · 2025-10-25T15:00:38Z

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

+                      values:
+                        - "true"
+        mainContainer:
+          image: rohanv672/dynamo:0.5.1-trtllm-ssh


Will need to update this obviously 🙂

Yes, testing with actual dynamo container today

itay · 2025-10-25T15:01:26Z

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

+            initialDelaySeconds: 30
+            periodSeconds: 10
+            timeoutSeconds: 5
+            failureThreshold: 3000


We should probably fix this to a more reasonable number.

Whats a good value you think, depending on pvc speed it can end up taking around an hour, I'm thinking 1.5 hours with a comment saying its dependent on your pvc speed

set to around 2 hrs and added comment about slow pvc potentially requiring update of probe

itay · 2025-10-25T15:02:30Z

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

+            requiredDuringSchedulingIgnoredDuringExecution:
+              nodeSelectorTerms:
+                - matchExpressions:
+                    - key: nvidia.com/gpu.present


Is this necessary given we also specify it in the limits?

itay · 2025-10-25T15:03:46Z

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

+metadata:
+  name: trtllm-test-compute-domain
+spec:
+  numNodes: 9


We should add a comment here and in prefill/decode that these need to match.

Or ideally that you don't need to specify it ahead of time and the compute domain will grow/shrink accordingly.

Once grove adds in support for automatic dra we'll remove this, will add in the comment

added comment

itay · 2025-10-25T15:04:40Z

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

+  name: trtllm-disagg-multinode
+spec:
+  pvcs:
+    - name: modelcache-pvc


Is there a reason we need the PVC here and in each component? Also where is this PVC defined?

ya will be adding in the same model_downloader setup as biswas

added model-downloader for pvc, also now i dont mount pvc into frontend

coderabbitai · 2025-10-30T11:33:36Z

Walkthrough

The changes reorganize TRTLLM configuration file paths from a flat structure to hierarchical aggregated/disaggregated subdirectories, rename model cache PVC references for consistency, add a Kubernetes model-download Job, and introduce new disaggregated deployment configurations for GB200 systems.

Changes

Cohort / File(s)	Summary
Configuration Path Reorganization `docs/backends/trtllm/README.md`, `docs/backends/trtllm/multinode/multinode-examples.md`, `examples/basics/multinode/trtllm/srun_aggregated.sh`, `examples/basics/multinode/trtllm/srun_disaggregated.sh`	Updated config file paths from flat structure (`wide_ep/`) to hierarchical organization separating aggregated (`agg/wide_ep/`) and disaggregated (`disagg/wide_ep/`) configurations across deployment scripts and documentation
PVC and Mount Path Updates `recipes/deepseek-r1/model-cache/model-cache.yaml`, `recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml`, `recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml`	Renamed PVC from `model-cache` to `model-cache-pvc`, increased storage from 1000Gi to 1500Gi, updated mount paths from `/root/.cache/huggingface` to `/model-cache`, and adjusted model-path arguments
Model Cache Management `recipes/deepseek-r1/model_cache/model-download.yaml`	Added Kubernetes Job to orchestrate model downloads via huggingface_hub, installing dependencies and downloading models to persistent volume with retry and completion settings
Load Balancer Configuration Updates `recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml`, `recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml`, `recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml`, `recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml`	Updated load_balancer paths to reflect new directory structure; added new `eplb.yaml` for disaggregated wide EP configuration with MOE load balancer settings
Disaggregated Deployment Manifests `recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml`, `recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml`	Introduced new multi-node disaggregated deployment and benchmark Job configurations for GB200 systems with prefill/decode separation, ConfigMaps for engine configs, DynamoGraphDeployment orchestration, and comprehensive performance benchmarking infrastructure

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Path consistency verification: Confirm all references to agg/wide_ep/ and disagg/wide_ep/ are applied uniformly across configuration files and documentation
PVC naming propagation: Validate that model-cache-pvc is consistently referenced in all volume mounts and claims across deployment manifests
New GB200 manifests: Review correctness of disaggregated deployment topology (prefill single-node, decode multi-node with 8 nodes), ConfigMap structure, and ComputeDomain configuration
Benchmark job logic: Assess the perf.yaml Job's shell scripting, aiperf integration, polling mechanisms, and environment variable handling for robustness
Model cache Job: Ensure model-download.yaml correctly mounts PVC and executes download commands without issues

Poem

🐰 Paths reorganized, now clean and neat,
Agg and disagg, a structured treat!
PVCs renamed, configs aligned,
GB200 awaits, benchmarks designed! ✨

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	❓ Inconclusive	The pull request description provides detailed technical content explaining the restructuring of the example layout, separation of aggregated and disaggregated cases, and specific changes to model-cache and model-download files. However, the description does not follow the required template structure with clearly labeled sections. Specifically, the description lacks the template headings (Overview, Details, Where should the reviewer start?, Related Issues), does not provide guidance on which specific files reviewers should prioritize, and does not reference any related GitHub issues using the required action keywords (Closes/Fixes/Resolves/Relates to). While the technical content is informative and substantive, the structural deviation from the template and missing guidance sections make it unclear whether this submission meets the repository's documentation standards.	The author should restructure the description to follow the provided template format with clear section headings. Specifically, they should: add an "Overview" section summarizing the PR purpose, provide a "Details" section with the change descriptions, add a "Where should the reviewer start?" section identifying critical files for review (such as recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml and the new model-download.yaml), and include a "Related Issues" section with the actual GitHub issue number using appropriate action keywords if applicable.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: restructure dsr1 recipes and add gb200" accurately reflects the two major themes of the changeset. The raw summary shows extensive restructuring of DeepSeek-R1 recipe paths through new agg/ and disagg/ subdirectories, plus addition of new GB200-specific deployment files (gb200/deploy.yaml and gb200/perf.yaml). The title is specific enough that a developer scanning history would understand the PR involves recipe reorganization and GB200 support, avoiding vague terms like "misc updates."

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

recipes/deepseek-r1/model_cache/model-download.yaml (1)

1-37: Approve with note on root user security posture.

The model-download Job is well-structured and the hardcoded model names (nvidia/DeepSeek-R1-FP4 and deepseek-ai/DeepSeek-R1) follow the intended recipe-specific pattern per learnings. The PVC naming is consistent, and downloading both FP4 and standard variants provides flexibility for different hardware (GB200 vs. Hopper as noted in documentation). However, Checkov flags CKV_K8S_23 and CKV_K8S_20 for running as root. While this is acceptable for a one-time batch download job, consider running as a non-root user for improved security posture.
recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml (2)
76-78: Add clarifying comment for decode nodeCount dependency.

Lines 76–78 appropriately document that ComputeDomain.numNodes must equal prefill.nodeCount + decode.nodeCount. While this is correctly calculated as 1 + 8 = 9 (line 84), the decode section (line 192) lacks a similar comment explaining that its nodeCount: 8 contributes to the ComputeDomain total. For consistency and to aid future maintainers, add a brief comment above line 192 as well.

Add this comment above line 192:
      multinode:
        # NOTE: This nodeCount contributes to ComputeDomain.numNodes (see above)
        nodeCount: 8
Also applies to: 76-192

155-162: Startup probe configuration is appropriate but document the intent.

Both prefill (lines 155–162) and decode (lines 213–220) services use identical startup probe settings with failureThreshold: 500 (~83 minutes at 10-second intervals). A past review suggested adding a comment explaining why these values match and their relationship to PVC speed. Consider adding a clarifying comment to line 155 (or above the prefill startupProbe block):
          # NOTE: Prefill and decode startupProbe thresholds must match.
          #       With PVC latency, model loading can take considerable time.
          #       Adjust failureThreshold based on your PVC performance.
          startupProbe:
Also applies to: 213-220

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bec8b3 and 3ed80d1.

📒 Files selected for processing (14)

docs/backends/trtllm/README.md (1 hunks)
docs/backends/trtllm/multinode/multinode-examples.md (2 hunks)
examples/basics/multinode/trtllm/srun_aggregated.sh (1 hunks)
examples/basics/multinode/trtllm/srun_disaggregated.sh (1 hunks)
recipes/deepseek-r1/model-cache/model-cache.yaml (1 hunks)
recipes/deepseek-r1/model_cache/model-download.yaml (1 hunks)
recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml (5 hunks)
recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml (5 hunks)
recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml (1 hunks)
recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml (1 hunks)
recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml (1 hunks)
recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml (1 hunks)
recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml (1 hunks)
recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.

Learnt from: biswapanda
PR: ai-dynamo/dynamo#3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.

📚 Learning: 2025-07-31T11:26:48.422Z

Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.

Applied to files:

docs/backends/trtllm/README.md
recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml
docs/backends/trtllm/multinode/multinode-examples.md
recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml
recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml
recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml

📚 Learning: 2025-07-02T13:20:28.800Z

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.

Applied to files:

docs/backends/trtllm/multinode/multinode-examples.md
examples/basics/multinode/trtllm/srun_disaggregated.sh

📚 Learning: 2025-10-24T04:21:08.751Z

Learnt from: biswapanda
PR: ai-dynamo/dynamo#3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.

Applied to files:

recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml
recipes/deepseek-r1/model_cache/model-download.yaml
recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml

📚 Learning: 2025-08-30T20:43:49.632Z

Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

📚 Learning: 2025-08-30T20:43:10.091Z

Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

📚 Learning: 2025-08-30T20:43:10.091Z

Learnt from: keivenchang
PR: ai-dynamo/dynamo#2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

📚 Learning: 2025-06-05T01:46:15.509Z

Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Applied to files:

examples/basics/multinode/trtllm/srun_aggregated.sh

🪛 Checkov (3.2.334)

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml

[medium] 3-153: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[high] 3-153: Container should not be privileged

(CKV_K8S_16)

[medium] 3-153: Minimize the admission of root containers

(CKV_K8S_23)

recipes/deepseek-r1/model_cache/model-download.yaml

[medium] 3-37: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 3-37: Minimize the admission of root containers

(CKV_K8S_23)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3891/merge) by nvrohanv.

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

[error] 73-73: Trailing whitespace found and removed by pre-commit. (pre-commit: trailing-whitespace)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: operator (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: operator (arm64)
GitHub Check: sglang (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: vllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: Build and Test - dynamo
GitHub Check: Mirror Repository to GitLab

🔇 Additional comments (14)

examples/basics/multinode/trtllm/srun_disaggregated.sh (1)

20-20: Path reorganization for disaggregated deployments looks good.

The default engine config paths now clearly reflect the disaggregated subdirectory structure (disagg/wide_ep/), making the distinction from aggregated configs explicit and improving discoverability.

Also applies to: 24-24

examples/basics/multinode/trtllm/srun_aggregated.sh (1)

21-21: Export addition and path reorganization are appropriate.

Explicitly exporting ENGINE_CONFIG ensures the variable is properly passed to worker containers via the --container-env flag (line 66), improving clarity and reliability of environment propagation in multi-node setups.

recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_prefill.yaml (1)

20-20: Load balancer path reorganization is consistent with disaggregated layout.

The path update to disagg/wide_ep/eplb.yaml aligns with the new directory structure, and the WIDEEP backend field (line 19) is correctly in all caps per TRTLLM LLM-API requirements.

docs/backends/trtllm/README.md (1)

165-165: Documentation path update maintains consistency with reorganized config structure.

The AGG_ENGINE_ARGS path update correctly reflects the new aggregated subdirectory layout (agg/mtp/), ensuring users follow current file organization when running examples.

recipes/deepseek-r1/trtllm/disagg/wide_ep/wide_ep_decode.yaml (1)

20-20: Load balancer path correctly reflects disaggregated config organization.

Mirrors the prefill config structure with disagg/wide_ep/eplb.yaml path, ensuring consistent load balancer reference across disaggregated deployment components.

recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml (1)

10-10: Deployment manifest consistently updates PVC reference and model paths.

All PVC references now use model-cache-pvc (consistent with the renamed PVC), and model paths correctly reference the mount point /model-cache/deepseek-r1. Changes are consistently applied to both decode and prefill workers.

Also applies to: 37-38, 58-58, 90-91, 111-111

recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml (1)

14-14: Aggregated config load balancer path correctly reorganized.

The path update to agg/wide_ep/eplb.yaml aligns with the aggregated subdirectory structure, mirroring the disaggregated pattern and improving config file organization and discoverability.

recipes/deepseek-r1/model-cache/model-cache.yaml (1)

6-6: PVC rename is incomplete—stale references remain across recipes; clarify scope and verify deepseek-r1 manifests are fully updated.

The verification reveals the PVC rename to model-cache-pvc is incomplete. While only 12 references to model-cache-pvc exist, stale references to the old model-cache name persist across multiple recipes (qwen3-32b-fp8, llama-3-70b, gpt-oss-120b, and deepseek-r1 itself).

Confirm whether this PR intends to:

Update only deepseek-r1 manifests (then verify all deepseek-r1 files reference model-cache-pvc)

Update all recipes consistently (then address stale references in qwen3-32b-fp8, llama-3-70b, gpt-oss-120b, and remaining deepseek-r1 files)

Cross-recipe inconsistency risks deployment failures if some manifests reference the old name.

recipes/deepseek-r1/trtllm/disagg/wide_ep/eplb.yaml (1)

1-7: LGTM on structure and licensing.

The new EPLB configuration file is clean, properly licensed, and the MOE load balancer settings (num_slots: 288, layer_updates_per_iter: 2) align with the TensorRT-LLM documentation referenced in the comment. This externalization of the load balancer config from inline definitions is a good refactor for maintainability.

recipes/deepseek-r1/sglang/disagg-8gpu/deploy.yaml (1)

10-10: Verify consistency of model cache directory naming.

The PVC name is correctly updated to model-cache-pvc (line 10), and mount paths consistently use /model-cache (lines 35–36, 84–85). However, the --model-path arguments reference /model-cache/deepseek-r1 (lines 56, 105) while the model-download job (recipes/deepseek-r1/model_cache/model-download.yaml) downloads to both /model-cache/deepseek-r1-fp4 and /model-cache/deepseek-r1. Confirm whether this is intentional (using the non-FP4 variant) or should match the FP4 path used in other deployments.

Also applies to: 35-36, 84-85, 56-56, 105-105

docs/backends/trtllm/multinode/multinode-examples.md (1)

139-139: Documentation paths correctly aligned with new structure.

The path updates properly reflect the reorganization into agg/wide_ep/ and disagg/wide_ep/ subdirectories. The example commands (lines 139, 168–169) will guide users to the correct configuration files under the new hierarchy. No functional issues detected.

Also applies to: 168-169

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml (1)

59-59: Align hardcoded tokenizer path with model-download outputs.

Line 59 hardcodes the tokenizer path to /model-cache/deepseek-r1-fp4, but the model-download job (recipes/deepseek-r1/model_cache/model-download.yaml) downloads models to both /model-cache/deepseek-r1-fp4 and /model-cache/deepseek-r1. Additionally, the deployment manifest (recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) references /model-cache/deepseek-r1 for the model path. Confirm whether the tokenizer should be parameterized (via an environment variable or ConfigMap) or hardcoded to match the deployment's model path.

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml (2)

119-119: Container image requires confirmation.

Line 119 references nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.0. A past review comment indicated this placeholder "will need to update," and there's a corresponding author comment stating "testing with actual dynamo container today." Confirm that this image tag is finalized and available in the registry before merge, or document if it's intentionally a placeholder for manual deployment configuration.

69-69: Backend field correctly uses all-caps WIDEEP.

Line 69 correctly specifies backend: WIDEEP (all capitals) in the decode config. This aligns with TRTLLM LLM-API requirements for backend field names. Based on learnings
, this is the correct format.

coderabbitai · 2025-10-30T11:33:39Z

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

+    cache_transceiver_config:
+      max_tokens_in_buffer: 4608
+      backend: DEFAULT
+    stream_interval: 20


⚠️ Potential issue | 🟡 Minor

Fix trailing whitespace on line 73.

The pre-commit hook flagged trailing whitespace at line 73. This is in the decode-config ConfigMap YAML. Remove any trailing spaces on this line before merge.

🧰 Tools

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3891/merge) by nvrohanv.

[error] 73-73: Trailing whitespace found and removed by pre-commit. (pre-commit: trailing-whitespace)

🤖 Prompt for AI Agents

In recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml around line 73, there is trailing whitespace after "stream_interval: 20"; open the file, remove any spaces or tabs after the value on line 73 so the line ends immediately after "20", save the file, and run the pre-commit hook or lint to confirm the trailing whitespace is gone.

coderabbitai · 2025-10-30T11:33:39Z

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml

+        imagePullPolicy: IfNotPresent
+        name: perf
+        securityContext:
+          privileged: true


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Remove privileged mode unless strictly necessary for performance testing.

The container runs with securityContext.privileged: true (line 142), which is flagged as a high-severity security risk (CKV_K8S_16). While the network tuning at line 34 (sysctl -w net.ipv4.ip_local_port_range) may require elevated privileges, consider whether this can be accomplished via allowPrivilegeEscalation and targeted Linux capabilities instead, or defer to the host network configuration.

Is the privileged mode strictly necessary, or can network tuning be handled outside the container or via setLinuxCapabilities?

I'll verify this review comment by examining the actual file and understanding the security context requirements.

🏁 Script executed:

#!/bin/bash # Find and examine the perf.yaml file fd -t f "perf.yaml" | head -20

Length of output: 334

🏁 Script executed:

#!/bin/bash # If found, examine the security context and network tuning sections fd -t f "perf.yaml" -x cat {}

Length of output: 41288

🏁 Script executed:

#!/bin/bash # Search for securityContext patterns in Job specs across the codebase rg -n "securityContext:" -A 5 -B 2 --type=yaml | grep -A 5 -B 2 "privileged"

Length of output: 3917

🏁 Script executed:

#!/bin/bash # Search for other network tuning or sysctl usage patterns rg -n "sysctl" -B 5 -A 5 --type=yaml

Length of output: 22978

--- 🌐 Web query: ```web Kubernetes pod securityContext sysctls net.ipv4.ip_local_port_range namespace restriction

💡 Result:

Short answer

net.ipv4.ip_local_port_range is in Kubernetes' "safe" (namespaced) sysctls and may be set per-Pod via spec.securityContext.sysctls. [1]

It cannot be used when the Pod uses hostNetwork (net namespace must be namespaced). [1]

Other sysctls that are not namespaced (node-level) cannot be set in pod securityContext; unsafe sysctls must be explicitly allowed on each node with kubelet's --allowed-unsafe-sysctls. [1]

Use admission policies (e.g., Kyverno/PSP replacements) to restrict which sysctls (including allowing only net.ipv4.ip_local_port_range) are permitted in your cluster. [2]

Relevant docs

Kubernetes docs: "Using sysctls in a Kubernetes Cluster" (safe vs unsafe, namespaced sysctls, hostNetwork rule, kubelet flag). [1]

Example policy enforcement (Kyverno) showing restricting pods to the safe sysctl set including net.ipv4.ip_local_port_range. [2]

Replace container-level privileged: true with pod-level sysctls to eliminate unnecessary privilege escalation.

The container runs with securityContext.privileged: true (line 142) to execute sysctl -w net.ipv4.ip_local_port_range="1024 65000". However, net.ipv4.ip_local_port_range is a namespaced (safe) sysctl in Kubernetes and can be set at the pod level via spec.securityContext.sysctls without requiring privileged mode.

Replace the privileged security context with:

spec: securityContext: runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000 sysctls: - name: net.ipv4.ip_local_port_range value: "1024 65000"

Remove the sysctl command from the container startup script. This pattern applies to all perf.yaml files in the repository (recipes/*/perf.yaml).

🤖 Prompt for AI Agents

In recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/perf.yaml around line 142, the container securityContext sets privileged: true solely to run sysctl -w net.ipv4.ip_local_port_range="1024 65000"; remove privileged: true and instead add a pod-level spec.securityContext with runAsUser: 1000, runAsGroup: 1000, fsGroup: 1000 and a sysctls list containing name: net.ipv4.ip_local_port_range and value: "1024 65000"; also remove the sysctl command from the container startup script; apply the same change to all recipes/*/perf.yaml files.

…d 0.6.0 image

…nities to empty, remove pvc from frontend

…creation

pull-request-size bot added the size/L label Oct 25, 2025

itay reviewed Oct 25, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 09:00 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 10:02 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 10:06 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 10:26 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 10:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 10:38 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 10:39 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:09 Inactive

nvrohanv changed the title ~~restructure dsr1 recipes and add gb200~~ feat: restructure dsr1 recipes and add gb200 Oct 30, 2025

github-actions bot added the feat label Oct 30, 2025

nvrohanv force-pushed the nvrohanv/add-dsr1-gb200-widep-recipe branch from 9bb62ad to 1692c43 Compare October 30, 2025 11:14

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:14 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:15 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:21 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:24 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:25 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:26 Inactive

nvrohanv marked this pull request as ready for review October 30, 2025 11:26

nvrohanv requested review from a team as code owners October 30, 2025 11:26

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:30 Inactive

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 17:35 Inactive

nvrohanv added 10 commits October 30, 2025 10:36

restructure dsr1 recipes and add gb200, needs testing with new container

db7f799

update model download path and align recipe with dsr1 perf configs an…

2d4f259

…d 0.6.0 image

reduce startupProbe to reasonable value, default tolerations and affi…

be88e3b

…nities to empty, remove pvc from frontend

change pvc structure to be more readable

5092bb3

add empty tolerations to model-download.yaml

acb1cf5

add comments explaining startup probe usage and compute domain usage

c85eed6

fix pvc syntax

ae3f551

remove unsupported flag in trtllm 1.1.0.rc5

f4fca66

make frontend component type be frontend instead of main for service …

559f32a

…creation

run pre-commit

6b83018

nvrohanv force-pushed the nvrohanv/add-dsr1-gb200-widep-recipe branch from 8406768 to 6b83018 Compare October 30, 2025 17:37

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 17:37 Inactive

remove redundant model-download.yaml

0a2afb1

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 17:40 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 17:50 Inactive

change aiperf installation to standard pip install

178d73a

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 18:08 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 18:09 Inactive

feat: restructure dsr1 recipes and add gb200 #3891

Are you sure you want to change the base?

feat: restructure dsr1 recipes and add gb200 #3891

Uh oh!

Conversation

nvrohanv commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvrohanv commented Oct 25, 2025 •

edited

Loading

coderabbitai bot commented Oct 30, 2025 •

edited

Loading