-
Couldn't load subscription status.
- Fork 662
ci: Enable deploy tests #3885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Enable deploy tests #3885
Conversation
Signed-off-by: Dillon Cullinan <[email protected]>
WalkthroughThree previously commented deployment test jobs for vllm, sglang, and trtllm backends are now activated and fully implemented in the GitHub Actions workflow. Each includes matrix strategies, environment configuration, Kubernetes operations, and comprehensive cleanup logic. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
.github/workflows/container-validation-backends.yml (3)
488-504: Model availability polling lacks timeout safeguard against infinite loops.The polling loop (lines 490–499) checks
$ATTEMPT -le $MAX_ATTEMPTS, but if the loop exits due toATTEMPTexceedingMAX_ATTEMPTS, the script checksif [ $ATTEMPT -gt $MAX_ATTEMPTS ]and exits with code 1 (line 503). However, if the model is found and the loop breaks early, the script continues to line 505. This logic is correct, but the lack ofset -emeans a prior step failure could leave the script in an inconsistent state.Additionally, the model availability check uses a simple string match in
jq, which may be fragile if the response format changes or if multiple models are present with similar names.Consider a more robust check:
while [ $ATTEMPT -le $MAX_ATTEMPTS ]; do MODELS_RESPONSE=$(curl -s --retry 5 --retry-delay 2 --retry-connrefused "${LLM_URL}/v1/models") - if echo "$MODELS_RESPONSE" | jq -e --arg MODEL_NAME "$MODEL_NAME" '.data[]?.id == $MODEL_NAME' >/dev/null 2>&1; then + if echo "$MODELS_RESPONSE" | jq -e --arg MODEL_NAME "$MODEL_NAME" '.data[] | select(.id == $MODEL_NAME)' >/dev/null 2>&1; then echo "Model $MODEL_NAME is available in /v1/models" break fi
505-541: API test may fail due to truncated prompt text in curl request.The curl request at line 505–519 includes a hardcoded prompt with truncated text (line 513:
"content": "In the heart of Eldoria...clue is hidden."appears incomplete). This may cause unexpected API behavior or validation failures. Additionally, the-N --no-bufferflags in curl are redundant for a non-streaming POST request (line 505 has"stream":false).Also, the test validates response fields (
choices[0].message.role,modelname,choices[0].message.contentlength) but does not check for HTTP status codes or error fields in the JSON response, which could mask API errors.
- Complete the prompt text or replace with a clearer, shorter prompt.
- Remove redundant curl flags:
- RESPONSE=$(curl -s -N --no-buffer --retry 10 --retry-delay 5 --retry-connrefused -X POST "${LLM_URL}/v1/chat/completions" \ + RESPONSE=$(curl -s --retry 10 --retry-delay 5 --retry-connrefused -X POST "${LLM_URL}/v1/chat/completions" \
- Add HTTP status code validation:
RESPONSE=$(curl -s --retry 10 --retry-delay 5 --retry-connrefused -X POST "${LLM_URL}/v1/chat/completions" \ + -w "\n%{http_code}" \ ... + HTTP_CODE=$(echo "$RESPONSE" | tail -n1) + RESPONSE=$(echo "$RESPONSE" | head -n-1) + if [ "$HTTP_CODE" != "200" ]; then + echo "Test failed: HTTP status code $HTTP_CODE" + exit 1 + fi
542-564: Cleanup step lacks error handling; namespace deletion may hang if resources are stuck.The cleanup (lines 542–564) runs with
if: always()andtimeout-minutes: 5, which is good, but the deletion commands use|| trueto suppress errors. However, the finalkubectl delete namespace(line 562) can hang indefinitely if resources have finalizers or if pods are in a stuck state, even though a 5-minute timeout is set at the job level. Additionally, if Helm uninstall fails (line 559), the namespace may still contain Helm-managed resources that block deletion.Add explicit timeout and force-delete options:
echo "Namespace $NAMESPACE deletion initiated, proceeding with cleanup..." - kubectl delete namespace $NAMESPACE || true + kubectl delete namespace $NAMESPACE --timeout=60s --ignore-not-found=true --grace-period=10 || true echo "Namespace $NAMESPACE completed."Also, add validation to ensure DynamoGraphDeployments are deleted before namespace deletion:
echo "Deleting all DynamoGraphDeployments in namespace $NAMESPACE..." kubectl delete dynamographdeployments --all -n $NAMESPACE || true + # Wait for deletion to complete + kubectl wait --for=delete dynamographdeployments --all -n $NAMESPACE --timeout=30s 2>/dev/null || true
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/container-validation-backends.yml(1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/container-validation-backends.yml
346-346: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
565-565: "steps" section is missing in job "deploy-test-sglang"
(syntax-check)
566-566: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
583-583: "steps" section must be sequence node but got alias node with "" tag
(syntax-check)
585-585: "steps" section is missing in job "deploy-test-trtllm"
(syntax-check)
586-586: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
605-605: "steps" section must be sequence node but got alias node with "" tag
(syntax-check)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
.github/workflows/container-validation-backends.yml (6)
346-346: Custom runner labelcpu-amd-m5-2xlargeis flagged by actionlint but likely valid for self-hosted runners.Static analysis flags
cpu-amd-m5-2xlargeas an unknown label (lines 346, 566, 586). This label appears to be a custom self-hosted runner tag, which is valid if configured in your GitHub Actions runner setup. However, actionlint needs configuration to recognize custom labels.To suppress this warning, ensure your
.actionlintrc.yamlor.actionlint.yamlconfig file includes aself_hosted_runnersection:self_hosted_runner: labels: - cpu-amd-m5-2xlarge - gpu-l40-amd64 - cpu-arm-r8g-4xlargeAlternatively, if these runners are not yet configured, verify the label names in your GitHub Actions runner configuration.
Verify that the runner label
cpu-amd-m5-2xlargeis registered in your GitHub Actions environment and update actionlint configuration if necessary.Also applies to: 566-566, 586-586
348-348: Job dependencies are correct, but verify that all three deploy-test jobs require the operator build to succeed.All three deploy-test jobs depend on
[changed-files, operator, vllm/sglang/trtllm](lines 348, 568, 588), ensuring that the operator image is built and available in Azure ACR before deployment tests run. However, each job needs the framework-specific image (e.g.,${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo:${{ github.sha }}-${FRAMEWORK}-amd64), which is pushed by the framework-specific build job (vllm, sglang, trtllm).This dependency chain is correctly modeled, but verify that the Azure pull-through cache is configured to serve all three framework images during the deployment test step.
Verify:
- Azure ACR pull-through cache is configured for all three backend images.
- The image registry path in the deployment step (line 452) matches the pushed image path from the build jobs (e.g., lines 131, 199, 266).
Also applies to: 568-568, 588-588
585-605: YAML alias syntax is valid; review comment is incorrect.The original claim that
steps: *deploy-test-stepsis "invalid GitHub Actions workflow syntax" is factually incorrect. GitHub Actions workflow files support YAML anchors (&) and aliases (*), and the workflow correctly defines the anchor at line 365 (steps: &deploy-test-steps) and references it at lines 583 and 605.The matrix profile differences are accurate (deploy-test-sglang has 2 profiles; deploy-test-vllm and deploy-test-trtllm have 4 profiles), but profile support differences between backends are intentional deployment decisions, not code errors.
Likely an incorrect or invalid review comment.
565-583: YAML alias is valid; remove critical issue claim.The
steps: *deploy-test-stepsreference is valid GitHub Actions syntax. The workflow correctly defines the anchor with&deploy-test-stepsin the deploy-test-vllm job and reuses it via the alias in deploy-test-sglang. GitHub Actions supports YAML anchors and aliases.The matrix profile difference is real: deploy-test-sglang uses only
[agg, agg_router]while deploy-test-vllm and deploy-test-trtllm include[agg, agg_router, disagg, disagg_router]. Verify whether this is intentional based on sglang backend capabilities, but this is not a code defect requiring fixes—the job will execute correctly.Likely an incorrect or invalid review comment.
345-364: The workflow pattern using YAML aliases forstepsis supported and valid in GitHub Actions.GitHub Actions supports YAML anchors (&) and aliases (*) in workflows, and the pattern used here—defining
steps: &deploy-test-stepsin one job and referencing it viasteps: *deploy-test-stepsin other jobs—is valid and documented. Job-level env variables properly scope to steps in that job and override workflow-level env when specified.When
deploy-test-sglanganddeploy-test-trtllmexecute the aliased steps, the environment variables from each job'senv:block (FRAMEWORK: sglang, FRAMEWORK: trtllm, etc.) will correctly apply to those steps. Variables like${FRAMEWORK}will expand to the correct framework name for each job.The workflow is correctly structured and should execute without issues. No changes are required.
448-504: The suggested fix is incompatible with the DynamoGraphDeployment custom resource.The review correctly identifies a potential race condition: the fixed 20-second sleep may be insufficient for the operator to create and label pods. However, the suggested fix is invalid. The workflow applies a
DynamoGraphDeploymentcustom resource (not a standard Kubernetes Deployment), so usingkubectl wait --for=condition=Progressing=True deploymentwill not work—this command targets standard Deployment resources.The DynamoGraphDeployment operator automatically injects the label
nvidia.com/dynamo-graph-deployment-nameinto pods, so the label concern is addressed. The real issue is timing: ensuring the operator processes the resource and creates labeled pods before the wait timeout.The fixed 20-second sleep warrants review, but a corrected approach would need to either: (1) increase the sleep duration empirically, or (2) implement a better wait strategy that checks for the custom resource's readiness conditions if available.
Likely an incorrect or invalid review comment.
Signed-off-by: Dillon Cullinan <[email protected]>
Overview:
Re-enables deploy tests now that we have pull-through caches in azure to avoid docker rate limiting.
Summary by CodeRabbit