Skip to content

Conversation

@Aphoh
Copy link
Contributor

@Aphoh Aphoh commented Oct 27, 2025

Overview:

Adds gpu/board part numbers and trtllm environment varibales to config dump.

Details:

  • Uses nvidia-smi -q xml format to grab part numbers in addition to driver version, memory size, etc.
  • Adds a bunch of environment prefixes used in trtllm

Summary by CodeRabbit

Release Notes

  • Improvements
    • Enhanced environment variable capture for system configuration diagnostics, now tracking additional configuration parameters across multiple system components
    • Improved GPU information detection and system diagnostics, providing more reliable hardware identification

@Aphoh Aphoh requested review from a team as code owners October 27, 2025 18:34
@github-actions github-actions bot added the feat label Oct 27, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 27, 2025

Walkthrough

The configuration dump module has been extended in two ways: environment variable prefixes have been expanded with 14 new prefixes to capture additional system configurations, and GPU system information extraction has been refactored to parse nvidia-smi XML output instead of CSV format.

Changes

Cohort / File(s) Change Summary
Configuration dump module enhancements
components/src/dynamo/common/config_dump/environment.py
Extended DEFAULT_ENV_PREFIXES list with 14 new environment variable name prefixes: LLM_, TLLM_, TRT_LLM_, TRTLLM_, NVIDIA_, NSYS_, GENERATE_CU_, OVERRIDE_, TOKENIZERS_, DISABLE_TORCH_, PYTORCH_, ENABLE_PERFECT_ROUTER, FLA_, NEMOTRON_
System information extraction refactor
components/src/dynamo/common/config_dump/system_info.py
Migrated nvidia-smi output parsing from CSV to XML format using xml.etree.ElementTree. Extracts driver_version from root element with "unknown" default, and GPU details (product_name, fb_memory_usage/total, board_part_number, gpu_part_number) from GPU elements. Updated error handling to catch ET.ParseError.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • XML parsing logic: Verify correct element paths and attribute extraction from nvidia-smi XML output
  • Error handling completeness: Ensure ET.ParseError and all edge cases (missing elements, malformed XML) are properly handled
  • Data extraction consistency: Confirm GPU details are correctly mapped from XML structure and maintain compatibility with downstream consumers

Poem

🐰 A bunny's tale of configs bright,
New prefixes caught in morning light,
XML tags replace the CSV way,
GPU data parsed anew today,
Configuration hoppy and refined! 🎉

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "feat: add trtllm env variables + gpu part numbers to config dump" accurately and concisely summarizes the main changes in the pull request. It clearly captures both primary aspects: the addition of TRTLLM-related environment variable prefixes and the extraction of GPU/board part numbers via XML parsing. The title is specific and meaningful enough that a developer scanning the project history would immediately understand the change's purpose without requiring additional context.
Description Check ✅ Passed The PR description provides the two most essential sections from the template: a clear Overview explaining what was added (GPU/board part numbers and TRTLLM environment variables to config dump) and Details describing how the changes were implemented (XML parsing of nvidia-smi output and new environment prefixes). However, the description is missing two template sections: "Where should the reviewer start?" which would guide reviewers to specific files, and "Related Issues" which would link to relevant GitHub issues. Despite these omissions, the description contains sufficient information for understanding the PR's scope and implementation, though the missing sections would enhance reviewer guidance.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
components/src/dynamo/common/config_dump/system_info.py (1)

138-164: Consider adding default values for missing GPU fields.

When optional fields like product_name, board_part_number, or gpu_part_number are missing, they're omitted from the gpu_info dict entirely. This creates inconsistent dict structures across GPUs.

Consider providing default values for consistency:

                # Extract product name
                product_name = gpu_elem.find("product_name")
-                if product_name is not None:
-                    gpu_info["name"] = product_name.text
+                gpu_info["name"] = product_name.text if product_name is not None else "unknown"

                # Extract driver version
                gpu_info["driver_version"] = driver_version

                # Extract memory total
                fb_memory = gpu_elem.find("fb_memory_usage/total")
-                if fb_memory is not None:
-                    gpu_info["memory_total"] = fb_memory.text
+                gpu_info["memory_total"] = fb_memory.text if fb_memory is not None else "unknown"

                # Extract board part number
                board_part = gpu_elem.find("board_part_number")
-                if board_part is not None:
-                    gpu_info["board_part_number"] = board_part.text
+                gpu_info["board_part_number"] = board_part.text if board_part is not None else None

                # Extract GPU part number
                gpu_part = gpu_elem.find("gpu_part_number")
-                if gpu_part is not None:
-                    gpu_info["gpu_part_number"] = gpu_part.text
+                gpu_info["gpu_part_number"] = gpu_part.text if gpu_part is not None else None
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e213d9 and 6c44fa8.

📒 Files selected for processing (2)
  • components/src/dynamo/common/config_dump/environment.py (1 hunks)
  • components/src/dynamo/common/config_dump/system_info.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
components/src/dynamo/common/config_dump/system_info.py

114-118: Starting a process with a partial executable path

(S607)


124-124: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)


171-171: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: sglang
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
components/src/dynamo/common/config_dump/environment.py (1)

24-37: Verify broad prefix scope is intentional.

Several newly added prefixes are quite broad and may capture many environment variables:

  • NVIDIA_ - catches all NVIDIA-related variables
  • OVERRIDE_ - catches any override-related variables
  • PYTORCH_ - overlaps with existing TORCH_ prefix (line 20)

Ensure this broad scope aligns with your config dump requirements, as it may capture more variables than necessary.

components/src/dynamo/common/config_dump/system_info.py (3)

111-111: LGTM: XML parsing approach is sound.

The switch from CSV to XML format provides structured access to additional fields like part numbers.

Also applies to: 116-117


124-124: Static analysis warning is a false positive here.

The static analysis tool flags XML parsing as potentially vulnerable (S314), but since you're parsing output from the locally-executed nvidia-smi command (a trusted system utility), this is not a security concern in this context.


167-172: Broad exception handling is acceptable here.

While catching bare Exception (flagged by BLE001) is generally discouraged, it's appropriate for this best-effort function that should never crash the caller. The debug logging ensures issues are still visible.

"TOKENIZERS_",
"DISABLE_TORCH_",
"PYTORCH_",
"ENABLE_PERFECT_ROUTER",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix inconsistent prefix pattern.

Line 35 lacks a trailing underscore, unlike all other prefixes. This means it will only match the exact variable name ENABLE_PERFECT_ROUTER, not variants like ENABLE_PERFECT_ROUTER_FOO.

Apply this diff to fix the inconsistency:

-    "ENABLE_PERFECT_ROUTER",
+    "ENABLE_PERFECT_ROUTER_",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"ENABLE_PERFECT_ROUTER",
"ENABLE_PERFECT_ROUTER_",
🤖 Prompt for AI Agents
In components/src/dynamo/common/config_dump/environment.py around line 35, the
prefix "ENABLE_PERFECT_ROUTER" is missing the trailing underscore used by all
other prefixes, so it only matches the exact variable name instead of variants;
update the prefix to "ENABLE_PERFECT_ROUTER_" (add the trailing underscore) so
it follows the same pattern and will match environment variables like
ENABLE_PERFECT_ROUTER_FOO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants