fix(scripts): add signal handlers to nemoclaw-start.sh for graceful container shutdown#1024
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe entrypoint script now stores the auto-pair watcher PID in Changes
Sequence Diagram(s)sequenceDiagram
participant Entrypoint as Entrypoint script (PID 1)
participant Gateway as Gateway process
participant AutoPair as Auto-pair watcher
participant External as External signal source
Entrypoint->>Gateway: start gateway (capture GATEWAY_PID)
Entrypoint->>AutoPair: start auto-pair watcher in background (capture AUTO_PAIR_PID)
note right of Entrypoint: register trap for SIGTERM / SIGINT
rect rgba(0,128,0,0.5)
External->>Entrypoint: SIGTERM / SIGINT
end
Entrypoint->>Gateway: forward SIGTERM / SIGINT
Entrypoint->>AutoPair: forward SIGTERM / SIGINT (if set)
Entrypoint->>Entrypoint: wait for Gateway and AutoPair to exit
Gateway-->>Entrypoint: exit
AutoPair-->>Entrypoint: exit
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)
250-255: Consider extracting the duplicatedcleanup()function.The
cleanup()function is defined identically in both the non-root (lines 250-254) and root (lines 311-315) code paths. Consider defining it once near the top of the script (afterstart_auto_pairbut before the branching logic), or as a helper function that can be reused.Also applies to: 311-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 250 - 255, The cleanup() function is duplicated in both branches; extract it into a single reusable function placed after start_auto_pair and before the root vs non-root branching so both paths call the same cleanup. Remove the duplicate definitions at the later branch locations, ensure the single cleanup still references GATEWAY_PID and AUTO_PAIR_PID (and uses kill -TERM and wait "$GATEWAY_PID"), and keep the trap registration (trap cleanup SIGTERM SIGINT) once after the unified definition so signal handling remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 311-320: The cleanup() function currently only waits for
GATEWAY_PID and returns (causing the later wait "$GATEWAY_PID" to fail/reap
twice and leaving AUTO_PAIR_PID un-waited); update cleanup() to kill both
"$GATEWAY_PID" and "${AUTO_PAIR_PID:-}" (as already done), then wait for both
PIDs (conditionally if AUTO_PAIR_PID is set), capture the exit code of the
gateway wait into a variable, and call exit with that status from inside
cleanup() so the script doesn't perform a second wait afterwards; reference
GATEWAY_PID, AUTO_PAIR_PID, and cleanup to locate and change the logic
accordingly.
- Around line 250-258: The trap handler cleanup() currently reaps GATEWAY_PID
and doesn't wait for AUTO_PAIR_PID, and the later wait "$GATEWAY_PID" can fail
and overwrite the real exit code; modify cleanup() to forward the signal to both
"$GATEWAY_PID" and "${AUTO_PAIR_PID:-}", then wait for each child (use
conditional waits if AUTO_PAIR_PID may be empty), capture their exit statuses
into variables (e.g., gateway_status and autopair_status), and exit the script
from cleanup() with the gateway_status so the gateway's real exit code is
preserved; remove or guard the standalone wait "$GATEWAY_PID" + exit $? at the
end to avoid double-waiting.
---
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 250-255: The cleanup() function is duplicated in both branches;
extract it into a single reusable function placed after start_auto_pair and
before the root vs non-root branching so both paths call the same cleanup.
Remove the duplicate definitions at the later branch locations, ensure the
single cleanup still references GATEWAY_PID and AUTO_PAIR_PID (and uses kill
-TERM and wait "$GATEWAY_PID"), and keep the trap registration (trap cleanup
SIGTERM SIGINT) once after the unified definition so signal handling remains
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 70f002ed-6a4b-4377-a208-153e8039af56
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
b6173b9 to
85b8f4e
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
scripts/nemoclaw-start.sh (1)
247-255:⚠️ Potential issue | 🟠 MajorExit from
cleanup()after reaping both children.Because
set -euo pipefailis active, returning fromcleanup()to the surroundingwaitat Line 257 / Line 320 is still risky: the shell can propagate the interrupted wait status instead of the gateway's actual exit code, andAUTO_PAIR_PIDis never waited on at all. Please capture the gateway status insidecleanup(), wait the watcher too, andexitfrom the trap so PID 1 owns the full shutdown path.🐛 Proposed fix pattern for both
cleanup()blockscleanup() { echo "[gateway] received signal, forwarding to children..." - kill -TERM "$GATEWAY_PID" "${AUTO_PAIR_PID:-}" 2>/dev/null - wait "$GATEWAY_PID" 2>/dev/null + local gateway_status=0 + kill -TERM "$GATEWAY_PID" 2>/dev/null || true + if [ -n "${AUTO_PAIR_PID:-}" ]; then + kill -TERM "$AUTO_PAIR_PID" 2>/dev/null || true + fi + wait "$GATEWAY_PID" 2>/dev/null || gateway_status=$? + if [ -n "${AUTO_PAIR_PID:-}" ]; then + wait "$AUTO_PAIR_PID" 2>/dev/null || true + fi + exit "$gateway_status" }Run this read-only check to confirm the current trap/wait shape before updating both branches. Expected result: both
cleanup()blocks showwait "$GATEWAY_PID", there is nowait "$AUTO_PAIR_PID", and the later foregroundwait "$GATEWAY_PID"is still present.#!/bin/bash printf '== cleanup / wait sites ==\n' rg -n -C2 'cleanup\(\)|trap cleanup|AUTO_PAIR_PID|wait "\$GATEWAY_PID"' scripts/nemoclaw-start.sh printf '\n== bash wait builtin ==\n' bash -lc 'help wait' | sed -n '1,25p'Also applies to: 308-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 247 - 255, The cleanup() trap functions currently forward SIGTERM/SIGINT but do not capture the gateway's exit status, fail to wait on AUTO_PAIR_PID, and return to the outer wait (causing set -euo pipefail to propagate an interrupted wait status); update both cleanup() definitions to: store the exit code of the gateway process (GATEWAY_PID) after killing it, explicitly wait for AUTO_PAIR_PID if set, then exit with the saved gateway status (use exit <saved_status>) so PID 1 performs the full shutdown path; ensure trap cleanup SIGTERM SIGINT remains and remove/replace any later standalone wait "$GATEWAY_PID" that would be bypassed by the trap if necessary.
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)
205-206: Add a regression test for the shutdown path.
test/nemoclaw-start.test.js:10-18currently only checks the non-root nohup/log regexes, so neithercleanup()nor the new PID tracking is exercised. A focused test that starts the entrypoint, sendsSIGTERM, and asserts both child processes are signaled/reaped would make this fix much harder to regress.Also applies to: 247-255, 308-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 205 - 206, Add a regression test that exercises the shutdown path by launching the entrypoint script, sending it SIGTERM, and asserting that cleanup() runs and all child watcher PIDs (e.g., AUTO_PAIR_PID and any other watcher PID variables created around the auto-pair/launcher blocks) are signaled and reaped; specifically, the test should spawn the script, wait for the "launched (pid ...)" outputs to capture child PIDs, send SIGTERM to the parent, then verify both that the child processes exit (reaped) and that any PID-tracking logic (the variables like AUTO_PAIR_PID and the cleanup() handler) were invoked/cleared. Ensure the test replaces the current non-root nohup/log regex checks in the existing test file with this focused lifecycle test so the new PID tracking and cleanup path cannot regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 247-255: The cleanup() trap functions currently forward
SIGTERM/SIGINT but do not capture the gateway's exit status, fail to wait on
AUTO_PAIR_PID, and return to the outer wait (causing set -euo pipefail to
propagate an interrupted wait status); update both cleanup() definitions to:
store the exit code of the gateway process (GATEWAY_PID) after killing it,
explicitly wait for AUTO_PAIR_PID if set, then exit with the saved gateway
status (use exit <saved_status>) so PID 1 performs the full shutdown path;
ensure trap cleanup SIGTERM SIGINT remains and remove/replace any later
standalone wait "$GATEWAY_PID" that would be bypassed by the trap if necessary.
---
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 205-206: Add a regression test that exercises the shutdown path by
launching the entrypoint script, sending it SIGTERM, and asserting that
cleanup() runs and all child watcher PIDs (e.g., AUTO_PAIR_PID and any other
watcher PID variables created around the auto-pair/launcher blocks) are signaled
and reaped; specifically, the test should spawn the script, wait for the
"launched (pid ...)" outputs to capture child PIDs, send SIGTERM to the parent,
then verify both that the child processes exit (reaped) and that any
PID-tracking logic (the variables like AUTO_PAIR_PID and the cleanup() handler)
were invoked/cleared. Ensure the test replaces the current non-root nohup/log
regex checks in the existing test file with this focused lifecycle test so the
new PID tracking and cleanup path cannot regress.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 78dc414f-b793-41c6-9acb-90be152abfb2
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
latenighthackathon
left a comment
There was a problem hiding this comment.
reviewed
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 266-267: The trap for cleanup is currently registered after
start_auto_pair and print_dashboard_urls run, leaving a startup window where
SIGTERM/SIGINT won't trigger graceful shutdown; move the trap registration (trap
cleanup SIGTERM SIGINT) to before invoking start_auto_pair and
print_dashboard_urls in both startup branches so cleanup is in place
immediately—specifically, place the trap call above the non-root startup
sequence that calls start_auto_pair and print_dashboard_urls and likewise above
the root-startup sequence so both code paths register the trap before launching
gateway/auto-pair processes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b1f6e745-1995-4b71-bddc-c81de52d42ec
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
There was a problem hiding this comment.
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)
211-225: Please add a regression test for this shutdown path.
test/nemoclaw-start.test.js:1-18only regex-checks the non-rootnohuplaunch, so it won’t catch another wrong-PID, late-trap, or wrong-exit-status regression here. A small script-level test that sendsSIGTERMand asserts both child PIDs are reaped would lock this down.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 211 - 225, Add a regression test that exercises the cleanup() shutdown path: spawn the script under test with mocked child processes whose PIDs become GATEWAY_PID and AUTO_PAIR_PID, send SIGTERM to the parent, and assert both child PIDs are terminated/reaped and the parent exits with the gateway_status returned by wait on GATEWAY_PID; ensure the test replaces/monitors GATEWAY_PID and AUTO_PAIR_PID, triggers the trap handler, and checks both wait() behaviors (including non-zero gateway_status) so regressions like wrong-PID, late-trap, or incorrect exit status are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 211-225: Add a regression test that exercises the cleanup()
shutdown path: spawn the script under test with mocked child processes whose
PIDs become GATEWAY_PID and AUTO_PAIR_PID, send SIGTERM to the parent, and
assert both child PIDs are terminated/reaped and the parent exits with the
gateway_status returned by wait on GATEWAY_PID; ensure the test
replaces/monitors GATEWAY_PID and AUTO_PAIR_PID, triggers the trap handler, and
checks both wait() behaviors (including non-zero gateway_status) so regressions
like wrong-PID, late-trap, or incorrect exit status are caught.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a7f349cc-09a5-4085-9d4c-5e151a224c29
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
…ontainer shutdown The entrypoint runs as PID 1 but had no trap for SIGTERM/SIGINT. On `docker stop`, wait was interrupted and child processes (gateway, auto-pair watcher) were orphaned until Docker sent SIGKILL after the grace period. - Capture auto-pair watcher PID in AUTO_PAIR_PID - Register trap that forwards SIGTERM to gateway and auto-pair - Apply to both root and non-root code paths Fixes NVIDIA#1015
…e-wait Address code review feedback: - Wait for AUTO_PAIR_PID in cleanup() to prevent zombie - Exit from within cleanup() to avoid double-waiting on GATEWAY_PID which would fail under set -e and lose the real exit status
…cstrings Address CodeRabbit review feedback: - Extract cleanup() into a single top-level function to eliminate duplicate definitions in non-root and root paths - Capture gateway exit status and exit from inside the trap handler to prevent set -euo pipefail from propagating interrupted wait status - Wait for AUTO_PAIR_PID conditionally to avoid orphaning the watcher - Add docstrings to write_auth_profile() and print_dashboard_urls() to meet the 80% docstring coverage threshold Closes NVIDIA#1015
…indow Move trap registration to immediately after GATEWAY_PID is captured and before start_auto_pair/print_dashboard_urls run. This eliminates a window during startup where SIGTERM/SIGINT would not trigger graceful shutdown. Addresses CodeRabbit review feedback on the startup race condition.
Verify the signal handling invariants that prevent future regressions: - cleanup() is defined exactly once (not duplicated) - Forwards SIGTERM to both GATEWAY_PID and AUTO_PAIR_PID - Waits for both child processes - Exits with gateway exit status - Trap is registered before start_auto_pair in both code paths - AUTO_PAIR_PID is captured from the background process Addresses CodeRabbit review suggestion.
de17cf8 to
0f56e82
Compare
Summary
The entrypoint (
scripts/nemoclaw-start.sh) runs as PID 1 inside the sandbox container but had no signal handlers. Ondocker stopornemoclaw <name> destroy, SIGTERM interruptedwaitand child processes (gateway, auto-pair watcher) were orphaned until Docker sent SIGKILL after the grace period.This adds a trap that forwards SIGTERM/SIGINT to both child processes for graceful shutdown.
Related Issue
Fixes #1015
Changes
AUTO_PAIR_PID(previously only echo'd, never stored)cleanup()trap that forwards SIGTERM to gateway and auto-pair processes${AUTO_PAIR_PID:-}for safe expansion if auto-pair wasn't startedType of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Manual verification:
docker stop <container>should now show[gateway] received signal, forwarding to children...in logsChecklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Summary by CodeRabbit
New Features
Bug Fixes