Skip to content

Conversation

@dillon-cullinan
Copy link
Contributor

@dillon-cullinan dillon-cullinan commented Oct 24, 2025

Overview:

Re-enables deploy tests now that we have pull-through caches in azure to avoid docker rate limiting.

Summary by CodeRabbit

  • Chores
    • Activated automated backend validation tests for multiple implementations (vllm, sglang, trtllm) in the CI pipeline.
    • Enhanced deployment testing with comprehensive validation steps including configuration management, dependency installation, and endpoint availability checks.

Signed-off-by: Dillon Cullinan <[email protected]>
@dillon-cullinan dillon-cullinan requested a review from a team as a code owner October 24, 2025 20:35
@dillon-cullinan dillon-cullinan changed the title Enable deploy tests ci: Enable deploy tests Oct 24, 2025
@github-actions github-actions bot added the ci Issues/PRs that reference CI build/test label Oct 24, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 27, 2025

Walkthrough

Three previously commented deployment test jobs for vllm, sglang, and trtllm backends are now activated and fully implemented in the GitHub Actions workflow. Each includes matrix strategies, environment configuration, Kubernetes operations, and comprehensive cleanup logic.

Changes

Cohort / Cohort(s) Summary
Backend validation workflow activation
.github/workflows/container-validation-backends.yml
Uncommented and implemented three deploy-test jobs (deploy-test-vllm, deploy-test-sglang, deploy-test-trtllm) with matrix-driven profile strategies, environment variables (FRAMEWORK, DYNAMO_INGRESS_SUFFIX, DEPLOYMENT_FILE, MODEL_NAME), Kubernetes deployment steps including namespace creation, Istio validation, Helm chart installation, kubeconfig setup, and model endpoint verification via HTTP requests. Added cleanup sequences to uninstall resources and namespaces.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Kubernetes and deployment logic: Verify correctness of kubectl, Helm, and yq commands for namespace creation, deployment validation, and resource cleanup
  • Matrix strategy configuration: Ensure profile matrix values are appropriate and environment variables correctly map to each backend
  • HTTP endpoint validation: Review the logic for model availability checks and port-forwarding setup for each backend
  • Job dependencies and permissions: Confirm needs, permissions, and runs-on fields are correctly configured for CI execution

Poem

🐰 Three backends woken from their sleep,
vllm, sglang, trtllm—no longer in the deep!
Kubernetes spins up, Helm charts deploy,
CI tests now run with operational joy! 🚀
Models respond, endpoints aligned,
Automation refined, validation designed! ✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description is largely incomplete compared to the required template. While the "Overview" section is provided and explains the motivation (re-enabling deploy tests due to pull-through caches in Azure), the description is missing three critical sections: "Details" (explaining what changes were made), "Where should the reviewer start?" (identifying specific files for review), and "Related Issues" (linking to relevant issues). Given that the raw summary indicates significant workflow changes with matrix strategies, environment variables, and comprehensive deployment logic, the lack of a Details section makes it difficult for reviewers to understand the technical scope of the changes. Add the missing "Details" section to describe the specific changes made to the workflow file (e.g., activating the deploy-test-vllm, deploy-test-sglang, and deploy-test-trtllm jobs with their matrix-driven strategies and deployment logic). Include a "Where should the reviewer start?" section pointing to .github/workflows/container-validation-backends.yml as the primary file for review. Consider adding a "Related Issues" section if this PR addresses or relates to any GitHub issues, using keywords like "Closes" or "Relates to" as shown in the template.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "ci: Enable deploy tests" is directly and clearly related to the main objective of the pull request, which involves uncommenting and enabling deploy-test jobs for vllm, sglang, and trtllm in the CI workflow. The title is concise, specific enough for a teammate scanning commit history to understand the primary change, and avoids vague terminology or file listings. It appropriately captures the essence of the change from the developer's perspective.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
.github/workflows/container-validation-backends.yml (3)

488-504: Model availability polling lacks timeout safeguard against infinite loops.

The polling loop (lines 490–499) checks $ATTEMPT -le $MAX_ATTEMPTS, but if the loop exits due to ATTEMPT exceeding MAX_ATTEMPTS, the script checks if [ $ATTEMPT -gt $MAX_ATTEMPTS ] and exits with code 1 (line 503). However, if the model is found and the loop breaks early, the script continues to line 505. This logic is correct, but the lack of set -e means a prior step failure could leave the script in an inconsistent state.

Additionally, the model availability check uses a simple string match in jq, which may be fragile if the response format changes or if multiple models are present with similar names.

Consider a more robust check:

  while [ $ATTEMPT -le $MAX_ATTEMPTS ]; do
    MODELS_RESPONSE=$(curl -s --retry 5 --retry-delay 2 --retry-connrefused "${LLM_URL}/v1/models")
-   if echo "$MODELS_RESPONSE" | jq -e --arg MODEL_NAME "$MODEL_NAME" '.data[]?.id == $MODEL_NAME' >/dev/null 2>&1; then
+   if echo "$MODELS_RESPONSE" | jq -e --arg MODEL_NAME "$MODEL_NAME" '.data[] | select(.id == $MODEL_NAME)' >/dev/null 2>&1; then
      echo "Model $MODEL_NAME is available in /v1/models"
      break
    fi

505-541: API test may fail due to truncated prompt text in curl request.

The curl request at line 505–519 includes a hardcoded prompt with truncated text (line 513: "content": "In the heart of Eldoria...clue is hidden." appears incomplete). This may cause unexpected API behavior or validation failures. Additionally, the -N --no-buffer flags in curl are redundant for a non-streaming POST request (line 505 has "stream":false).

Also, the test validates response fields (choices[0].message.role, model name, choices[0].message.content length) but does not check for HTTP status codes or error fields in the JSON response, which could mask API errors.

  1. Complete the prompt text or replace with a clearer, shorter prompt.
  2. Remove redundant curl flags:
- RESPONSE=$(curl -s -N --no-buffer --retry 10 --retry-delay 5 --retry-connrefused -X POST "${LLM_URL}/v1/chat/completions" \
+ RESPONSE=$(curl -s --retry 10 --retry-delay 5 --retry-connrefused -X POST "${LLM_URL}/v1/chat/completions" \
  1. Add HTTP status code validation:
  RESPONSE=$(curl -s --retry 10 --retry-delay 5 --retry-connrefused -X POST "${LLM_URL}/v1/chat/completions" \
+   -w "\n%{http_code}" \
    ...
+ HTTP_CODE=$(echo "$RESPONSE" | tail -n1)
+ RESPONSE=$(echo "$RESPONSE" | head -n-1)
+ if [ "$HTTP_CODE" != "200" ]; then
+   echo "Test failed: HTTP status code $HTTP_CODE"
+   exit 1
+ fi

542-564: Cleanup step lacks error handling; namespace deletion may hang if resources are stuck.

The cleanup (lines 542–564) runs with if: always() and timeout-minutes: 5, which is good, but the deletion commands use || true to suppress errors. However, the final kubectl delete namespace (line 562) can hang indefinitely if resources have finalizers or if pods are in a stuck state, even though a 5-minute timeout is set at the job level. Additionally, if Helm uninstall fails (line 559), the namespace may still contain Helm-managed resources that block deletion.

Add explicit timeout and force-delete options:

  echo "Namespace $NAMESPACE deletion initiated, proceeding with cleanup..."
- kubectl delete namespace $NAMESPACE || true
+ kubectl delete namespace $NAMESPACE --timeout=60s --ignore-not-found=true --grace-period=10 || true
  echo "Namespace $NAMESPACE completed."

Also, add validation to ensure DynamoGraphDeployments are deleted before namespace deletion:

  echo "Deleting all DynamoGraphDeployments in namespace $NAMESPACE..."
  kubectl delete dynamographdeployments --all -n $NAMESPACE || true
+ # Wait for deletion to complete
+ kubectl wait --for=delete dynamographdeployments --all -n $NAMESPACE --timeout=30s 2>/dev/null || true
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9bb1af3 and 8186578.

📒 Files selected for processing (1)
  • .github/workflows/container-validation-backends.yml (1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/container-validation-backends.yml

346-346: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


565-565: "steps" section is missing in job "deploy-test-sglang"

(syntax-check)


566-566: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


583-583: "steps" section must be sequence node but got alias node with "" tag

(syntax-check)


585-585: "steps" section is missing in job "deploy-test-trtllm"

(syntax-check)


586-586: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


605-605: "steps" section must be sequence node but got alias node with "" tag

(syntax-check)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
.github/workflows/container-validation-backends.yml (6)

346-346: Custom runner label cpu-amd-m5-2xlarge is flagged by actionlint but likely valid for self-hosted runners.

Static analysis flags cpu-amd-m5-2xlarge as an unknown label (lines 346, 566, 586). This label appears to be a custom self-hosted runner tag, which is valid if configured in your GitHub Actions runner setup. However, actionlint needs configuration to recognize custom labels.

To suppress this warning, ensure your .actionlintrc.yaml or .actionlint.yaml config file includes a self_hosted_runner section:

self_hosted_runner:
  labels:
    - cpu-amd-m5-2xlarge
    - gpu-l40-amd64
    - cpu-arm-r8g-4xlarge

Alternatively, if these runners are not yet configured, verify the label names in your GitHub Actions runner configuration.

Verify that the runner label cpu-amd-m5-2xlarge is registered in your GitHub Actions environment and update actionlint configuration if necessary.

Also applies to: 566-566, 586-586


348-348: Job dependencies are correct, but verify that all three deploy-test jobs require the operator build to succeed.

All three deploy-test jobs depend on [changed-files, operator, vllm/sglang/trtllm] (lines 348, 568, 588), ensuring that the operator image is built and available in Azure ACR before deployment tests run. However, each job needs the framework-specific image (e.g., ${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo:${{ github.sha }}-${FRAMEWORK}-amd64), which is pushed by the framework-specific build job (vllm, sglang, trtllm).

This dependency chain is correctly modeled, but verify that the Azure pull-through cache is configured to serve all three framework images during the deployment test step.

Verify:

  1. Azure ACR pull-through cache is configured for all three backend images.
  2. The image registry path in the deployment step (line 452) matches the pushed image path from the build jobs (e.g., lines 131, 199, 266).

Also applies to: 568-568, 588-588


585-605: YAML alias syntax is valid; review comment is incorrect.

The original claim that steps: *deploy-test-steps is "invalid GitHub Actions workflow syntax" is factually incorrect. GitHub Actions workflow files support YAML anchors (&) and aliases (*), and the workflow correctly defines the anchor at line 365 (steps: &deploy-test-steps) and references it at lines 583 and 605.

The matrix profile differences are accurate (deploy-test-sglang has 2 profiles; deploy-test-vllm and deploy-test-trtllm have 4 profiles), but profile support differences between backends are intentional deployment decisions, not code errors.

Likely an incorrect or invalid review comment.


565-583: YAML alias is valid; remove critical issue claim.

The steps: *deploy-test-steps reference is valid GitHub Actions syntax. The workflow correctly defines the anchor with &deploy-test-steps in the deploy-test-vllm job and reuses it via the alias in deploy-test-sglang. GitHub Actions supports YAML anchors and aliases.

The matrix profile difference is real: deploy-test-sglang uses only [agg, agg_router] while deploy-test-vllm and deploy-test-trtllm include [agg, agg_router, disagg, disagg_router]. Verify whether this is intentional based on sglang backend capabilities, but this is not a code defect requiring fixes—the job will execute correctly.

Likely an incorrect or invalid review comment.


345-364: The workflow pattern using YAML aliases for steps is supported and valid in GitHub Actions.

GitHub Actions supports YAML anchors (&) and aliases (*) in workflows, and the pattern used here—defining steps: &deploy-test-steps in one job and referencing it via steps: *deploy-test-steps in other jobs—is valid and documented. Job-level env variables properly scope to steps in that job and override workflow-level env when specified.

When deploy-test-sglang and deploy-test-trtllm execute the aliased steps, the environment variables from each job's env: block (FRAMEWORK: sglang, FRAMEWORK: trtllm, etc.) will correctly apply to those steps. Variables like ${FRAMEWORK} will expand to the correct framework name for each job.

The workflow is correctly structured and should execute without issues. No changes are required.


448-504: The suggested fix is incompatible with the DynamoGraphDeployment custom resource.

The review correctly identifies a potential race condition: the fixed 20-second sleep may be insufficient for the operator to create and label pods. However, the suggested fix is invalid. The workflow applies a DynamoGraphDeployment custom resource (not a standard Kubernetes Deployment), so using kubectl wait --for=condition=Progressing=True deployment will not work—this command targets standard Deployment resources.

The DynamoGraphDeployment operator automatically injects the label nvidia.com/dynamo-graph-deployment-name into pods, so the label concern is addressed. The real issue is timing: ensuring the operator processes the resource and creates labeled pods before the wait timeout.

The fixed 20-second sleep warrants review, but a corrected approach would need to either: (1) increase the sleep duration empirically, or (2) implement a better wait strategy that checks for the custom resource's readiness conditions if available.

Likely an incorrect or invalid review comment.

@dillon-cullinan dillon-cullinan merged commit 2c2eb8f into main Oct 28, 2025
32 of 33 checks passed
@dillon-cullinan dillon-cullinan deleted the reenable-deploy-tests branch October 28, 2025 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Issues/PRs that reference CI build/test size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants