fix(onboard): clean up build context temp dir on sandbox creation failure#375
fix(onboard): clean up build context temp dir on sandbox creation failure#375futhgar wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughReplace unconditional post-staging deletion with a dedicated Changes
Sequence Diagram(s)sequenceDiagram
participant Onboard as "onboard.js"
participant Node as "Node Process"
participant FS as "Filesystem (buildCtx)"
participant Sandbox as "Sandbox"
rect rgba(200,220,255,0.5)
Onboard->>FS: create temporary buildCtx
Onboard->>Node: register exit handler (cleanupBuildCtx)
end
rect rgba(200,255,200,0.5)
Onboard->>Sandbox: stage files & create sandbox
Sandbox-->>Onboard: ready / error
end
alt error -> process.exit called
Node->>Node: emit exit
Node->>FS: cleanupBuildCtx() removes buildCtx
else success
Onboard->>FS: call cleanupBuildCtx()
Onboard->>Node: remove exit handler
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
423-458: Usetry/finallyin addition to the exit hook for non-exit exceptions.At Line 423 onward, a thrown exception (e.g., synchronous FS errors) can skip immediate cleanup and leave sensitive temp contents until process exit. Keep the exit hook, but also scope the staging/create block with
try/finallyfor immediate teardown.♻️ Suggested structure
const buildCtx = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-build-")); const cleanupBuildCtx = () => { try { fs.rmSync(buildCtx, { recursive: true, force: true }); } catch {} }; process.on("exit", cleanupBuildCtx); - - fs.copyFileSync(path.join(ROOT, "Dockerfile"), path.join(buildCtx, "Dockerfile")); - run(`cp -r "${path.join(ROOT, "nemoclaw")}" "${buildCtx}/nemoclaw"`); - run(`cp -r "${path.join(ROOT, "nemoclaw-blueprint")}" "${buildCtx}/nemoclaw-blueprint"`); - run(`cp -r "${path.join(ROOT, "scripts")}" "${buildCtx}/scripts"`); - run(`rm -rf "${buildCtx}/nemoclaw/node_modules" "${buildCtx}/nemoclaw/src"`, { ignoreError: true }); + try { + fs.copyFileSync(path.join(ROOT, "Dockerfile"), path.join(buildCtx, "Dockerfile")); + run(`cp -r "${path.join(ROOT, "nemoclaw")}" "${buildCtx}/nemoclaw"`); + run(`cp -r "${path.join(ROOT, "nemoclaw-blueprint")}" "${buildCtx}/nemoclaw-blueprint"`); + run(`cp -r "${path.join(ROOT, "scripts")}" "${buildCtx}/scripts"`); + run(`rm -rf "${buildCtx}/nemoclaw/node_modules" "${buildCtx}/nemoclaw/src"`, { ignoreError: true }); + + // ... sandbox create + forwarding steps ... + } finally { + cleanupBuildCtx(); + process.removeListener("exit", cleanupBuildCtx); + } - - // Clean up build context and deregister the exit handler - cleanupBuildCtx(); - process.removeListener("exit", cleanupBuildCtx);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 423 - 458, The staging and sandbox creation sequence (the block using run(...), the openshell sandbox create/forward calls and related temp work that references sandboxName and uses run) must be wrapped in a try/finally so cleanupBuildCtx() and process.removeListener("exit", cleanupBuildCtx) run immediately on any thrown exception (not only at process exit); keep the existing process.on("exit", cleanupBuildCtx) hook but surround the code that copies files, builds createArgs/envArgs, calls run(`openshell sandbox create ...`) and the forward start/stop calls with try { /* existing code */ } finally { cleanupBuildCtx(); process.removeListener("exit", cleanupBuildCtx); } so temporary files are removed deterministically even for synchronous errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/onboard-build-cleanup.test.js`:
- Around line 64-66: Replace the fragile listener count check with a direct
containment check: call process.listeners("exit") and assert it does not include
the cleanup handler (the cleanup function referenced in the test), i.e., verify
!process.listeners("exit").includes(cleanup) instead of comparing
process.listenerCount("exit") > 0; update the test assertion logic around the
existing cleanup reference to ensure it specifically asserts the cleanup
function was deregistered.
---
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 423-458: The staging and sandbox creation sequence (the block
using run(...), the openshell sandbox create/forward calls and related temp work
that references sandboxName and uses run) must be wrapped in a try/finally so
cleanupBuildCtx() and process.removeListener("exit", cleanupBuildCtx) run
immediately on any thrown exception (not only at process exit); keep the
existing process.on("exit", cleanupBuildCtx) hook but surround the code that
copies files, builds createArgs/envArgs, calls run(`openshell sandbox create
...`) and the forward start/stop calls with try { /* existing code */ } finally
{ cleanupBuildCtx(); process.removeListener("exit", cleanupBuildCtx); } so
temporary files are removed deterministically even for synchronous errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2113ebd5-afa0-40d1-bb6b-f5997b7e6982
📒 Files selected for processing (2)
bin/lib/onboard.jstest/onboard-build-cleanup.test.js
cv
left a comment
There was a problem hiding this comment.
The problem is real — run() calling process.exit() bypasses try/finally, leaving build context with source code and potentially credentials in /tmp. The process.on('exit') approach is the correct fix for this.
Stale against main
The createSandbox function has changed since this PR was written. Main now uses streamSandboxCreate() instead of run() with the awk pipe, and the env args section has been restructured (shellQuote added, Discord/Slack tokens added). The PR will conflict on merge.
The fix itself is sound
process.on('exit', cleanupBuildCtx) fires even on process.exit() calls — this is the right pattern. The finally block provides cleanup on normal flow, and the exit handler catches the process.exit() path. Deregistering the handler after successful cleanup prevents double-cleanup.
Test uses node:test instead of vitest
The test file (test/onboard-build-cleanup.test.js) uses require("node:test") and require("node:assert/strict"). The repo has migrated to vitest — this will fail in CI with "No test suite found" (same issue we hit on other PRs). Needs conversion to vitest with import { describe, it, expect } from "vitest".
The behavioral tests are good
Spawning a child process that registers the exit handler and then process.exit(1) is the right way to test this — it validates actual process.exit() behavior rather than matching source patterns.
Rebase onto current main and convert the test to vitest, and this is ready to merge.
10b5a2a to
0aff7d6
Compare
|
Thanks @cv — rebased onto current main and addressed both points:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
548-636: Clean the build context immediately after a successful create.After
streamSandboxCreate()returns0, the readiness poll and port-forward setup no longer usebuildCtx. Keeping the temp tree around until the outerfinallyextends the on-disk exposure window by up to ~60s on the success path.♻️ Suggested reshape
- try { + try { fs.copyFileSync(path.join(ROOT, "Dockerfile"), path.join(buildCtx, "Dockerfile")); run(`cp -r "${path.join(ROOT, "nemoclaw")}" "${buildCtx}/nemoclaw"`); run(`cp -r "${path.join(ROOT, "nemoclaw-blueprint")}" "${buildCtx}/nemoclaw-blueprint"`); run(`cp -r "${path.join(ROOT, "scripts")}" "${buildCtx}/scripts"`); run(`rm -rf "${buildCtx}/nemoclaw/node_modules"`, { ignoreError: true }); @@ if (createResult.status !== 0) { console.error(""); console.error(` Sandbox creation failed (exit ${createResult.status}).`); if (createResult.output) { console.error(""); console.error(createResult.output); } console.error(" Try: openshell sandbox list # check gateway state"); console.error(" Try: nemoclaw onboard # retry from scratch"); process.exit(createResult.status || 1); } + } finally { + cleanupBuildCtx(); + process.removeListener("exit", cleanupBuildCtx); + } - // Wait for sandbox to reach Ready state in k3s before registering. + // Wait for sandbox to reach Ready state in k3s before registering. console.log(" Waiting for sandbox to become ready..."); let ready = false; for (let i = 0; i < 30; i++) { const list = runCapture("openshell sandbox list 2>&1", { ignoreError: true }); if (isSandboxReady(list, sandboxName)) { ready = true; break; } require("child_process").spawnSync("sleep", ["2"]); } @@ - run(`openshell forward start --background 18789 "${sandboxName}"`, { ignoreError: true }); - } finally { - cleanupBuildCtx(); - process.removeListener("exit", cleanupBuildCtx); - } + run(`openshell forward start --background 18789 "${sandboxName}"`, { ignoreError: true });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 548 - 636, The build context (buildCtx) is only cleaned in the outer finally, leaving temporary files on disk until readiness polling finishes; after streamSandboxCreate() returns success you should immediately call cleanupBuildCtx() and remove the exit listener so the temp tree is removed early — modify the block after checking createResult.status === 0 (i.e., after the const createResult = await streamSandboxCreate(...) and the failure branch) to invoke cleanupBuildCtx() and process.removeListener("exit", cleanupBuildCtx) before proceeding to isSandboxReady polling and port-forward setup so buildCtx is removed on the successful path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 548-636: The build context (buildCtx) is only cleaned in the outer
finally, leaving temporary files on disk until readiness polling finishes; after
streamSandboxCreate() returns success you should immediately call
cleanupBuildCtx() and remove the exit listener so the temp tree is removed early
— modify the block after checking createResult.status === 0 (i.e., after the
const createResult = await streamSandboxCreate(...) and the failure branch) to
invoke cleanupBuildCtx() and process.removeListener("exit", cleanupBuildCtx)
before proceeding to isSandboxReady polling and port-forward setup so buildCtx
is removed on the successful path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 55f158a3-99c5-442d-9844-9f2298e22dab
📒 Files selected for processing (2)
bin/lib/onboard.jstest/onboard-build-cleanup.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/onboard-build-cleanup.test.js
…lure Register a process 'exit' handler to guarantee the build context temp directory is removed even when run() calls process.exit() on command failure (which bypasses try/finally). Wrap the sandbox creation block in try/finally for immediate cleanup on sync exceptions and on the success path. Signed-off-by: Josue Gomez <josue@guatulab.com>
0aff7d6 to
de0c943
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1769-1772: The inline cleanup function cleanupBuildCtx currently
swallows fs.rmSync errors and the exit handler is unconditionally removed later;
change cleanupBuildCtx to return a boolean indicating success (true when
fs.rmSync deletes the buildCtx, false when it fails) and log the caught error
instead of silencing it, then only call process.removeListener/removeHandler for
the "exit" event when cleanupBuildCtx() returns true so the fallback exit
handler remains registered after an inline failure; update all places where the
listener is deregistered (the code that calls process.off/process.removeListener
with cleanupBuildCtx) to first invoke cleanupBuildCtx and conditionally remove
the listener on success.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 55da4595-1ec3-42f8-a850-c4486e0921c3
📒 Files selected for processing (2)
bin/lib/onboard.jstest/onboard-build-cleanup.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/onboard-build-cleanup.test.js
| const cleanupBuildCtx = () => { | ||
| try { fs.rmSync(buildCtx, { recursive: true, force: true }); } catch {} | ||
| }; | ||
| process.on("exit", cleanupBuildCtx); |
There was a problem hiding this comment.
Don't drop the fallback cleanup after a failed rmSync.
Line 1770 swallows every fs.rmSync() failure, but Line 1883 removes the "exit" handler unconditionally. If inline cleanup fails on the success path, /tmp/nemoclaw-build-* is left behind and there is no retry at process shutdown. Have cleanupBuildCtx() report success/failure and only deregister the listener after a successful removal.
Suggested change
- const cleanupBuildCtx = () => {
- try { fs.rmSync(buildCtx, { recursive: true, force: true }); } catch {}
- };
+ const cleanupBuildCtx = () => {
+ try {
+ fs.rmSync(buildCtx, { recursive: true, force: true });
+ return true;
+ } catch (error) {
+ console.error(` Failed to remove temporary build context '${buildCtx}': ${error.message}`);
+ return false;
+ }
+ };
...
- } finally {
- cleanupBuildCtx();
- process.removeListener("exit", cleanupBuildCtx);
- }
+ } finally {
+ if (cleanupBuildCtx()) {
+ process.removeListener("exit", cleanupBuildCtx);
+ }
+ }Also applies to: 1881-1883
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 1769 - 1772, The inline cleanup function
cleanupBuildCtx currently swallows fs.rmSync errors and the exit handler is
unconditionally removed later; change cleanupBuildCtx to return a boolean
indicating success (true when fs.rmSync deletes the buildCtx, false when it
fails) and log the caught error instead of silencing it, then only call
process.removeListener/removeHandler for the "exit" event when cleanupBuildCtx()
returns true so the fallback exit handler remains registered after an inline
failure; update all places where the listener is deregistered (the code that
calls process.off/process.removeListener with cleanupBuildCtx) to first invoke
cleanupBuildCtx and conditionally remove the listener on success.
chore: add vouch system for first-time contributors
cleanupBuildCtx now returns a boolean so the process exit handler is only deregistered when rmSync actually succeeds. If inline cleanup fails (e.g. EPERM), the exit handler stays registered as a safety net to retry removal on process exit.
Summary
The build context temp directory (
/tmp/nemoclaw-build-*) contains the Dockerfile, NemoClaw source code, blueprint policies, and scripts. Ifopenshell sandbox createfails during onboarding,run()callsprocess.exit()which bypassestry/finallyblocks, leaving the temp directory on disk permanently.On multi-user systems (e.g., DGX Spark), this leaks project files into the world-readable
/tmp.Fix
Register a
process.on('exit')handler immediately after creating the temp directory. This handler fires even whenprocess.exit()is called, guaranteeing cleanup regardless of how the function exits. On success, the handler is explicitly deregistered after cleanup.Key insight:
process.exit()skipstry/finallyblocks but does executeprocess.on('exit')handlers synchronously before termination.Changes
bin/lib/onboard.jsrun("rm -rf ...")withfs.rmSync()+ handler deregistrationtest/onboard-build-cleanup.test.jsprocess.exit(1)(failure path) and on success with handler deregistrationTest plan
npm test— 20/20 core tests pass (no regressions)process.exit(1)mid-build → temp dir removed by exit handlernode -esimulation confirms/tmp/nemoclaw-build-*is cleaned up even onprocess.exit(1)Summary by CodeRabbit
Bug Fixes
Tests