Skip to content

fix(onboard): clean up build context temp dir on sandbox creation failure#375

Open
futhgar wants to merge 2 commits intoNVIDIA:mainfrom
futhgar:fix/onboard-build-context-cleanup
Open

fix(onboard): clean up build context temp dir on sandbox creation failure#375
futhgar wants to merge 2 commits intoNVIDIA:mainfrom
futhgar:fix/onboard-build-context-cleanup

Conversation

@futhgar
Copy link
Copy Markdown

@futhgar futhgar commented Mar 19, 2026

Summary

The build context temp directory (/tmp/nemoclaw-build-*) contains the Dockerfile, NemoClaw source code, blueprint policies, and scripts. If openshell sandbox create fails during onboarding, run() calls process.exit() which bypasses try/finally blocks, leaving the temp directory on disk permanently.

On multi-user systems (e.g., DGX Spark), this leaks project files into the world-readable /tmp.

Fix

Register a process.on('exit') handler immediately after creating the temp directory. This handler fires even when process.exit() is called, guaranteeing cleanup regardless of how the function exits. On success, the handler is explicitly deregistered after cleanup.

Key insight: process.exit() skips try/finally blocks but does execute process.on('exit') handlers synchronously before termination.

Changes

File Change
bin/lib/onboard.js Register exit handler for build context cleanup; replace run("rm -rf ...") with fs.rmSync() + handler deregistration
test/onboard-build-cleanup.test.js 2 tests: verify cleanup on process.exit(1) (failure path) and on success with handler deregistration

Test plan

  • npm test — 20/20 core tests pass (no regressions)
  • New test: process.exit(1) mid-build → temp dir removed by exit handler
  • New test: success path → temp dir removed, exit handler deregistered, 0 leaked listeners
  • Manual verification: node -e simulation confirms /tmp/nemoclaw-build-* is cleaned up even on process.exit(1)

Summary by CodeRabbit

  • Bug Fixes

    • Ensures temporary build directories are reliably removed by adding a process-exit cleanup handler and removing it after successful completion, so cleanup runs on both normal and abrupt termination.
  • Tests

    • Added integration tests that simulate abrupt exits and normal completion to validate the cleanup handler runs and is properly deregistered when appropriate.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 67155f84-dffd-49c7-bbd4-2d0550295dc4

📥 Commits

Reviewing files that changed from the base of the PR and between de0c943 and 6d575aa.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard-build-cleanup.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard-build-cleanup.test.js

📝 Walkthrough

Walkthrough

Replace unconditional post-staging deletion with a dedicated cleanupBuildCtx() registered on process.on("exit"); wrap staging and sandbox creation in try/finally to run cleanup and remove the listener on success; adjust failure paths to rely on the exit handler. (47 words)

Changes

Cohort / File(s) Summary
Build context cleanup
bin/lib/onboard.js
Add cleanupBuildCtx() using fs.rmSync(buildCtx, { recursive: true, force: true }); register it via process.on("exit", cleanupBuildCtx) before staging; wrap staging + sandbox creation/forwarding in try { ... } finally { cleanupBuildCtx(); process.removeListener("exit", cleanupBuildCtx) }; on sandbox creation failures retain process.exit(...) and rely on the exit handler for cleanup.
Integration tests (subprocess)
test/onboard-build-cleanup.test.js
Add Vitest subprocess tests (3 scenarios) that spawn node -e scripts to verify exit-handler cleanup: (1) exit with process.exit(1) and ensure temp dir removed, (2) explicit inline cleanup + removeListener then exit and assert removed and exit code, (3) failing inline cleanup that leaves the listener registered and ensures the exit handler removes the dir on normal completion.

Sequence Diagram(s)

sequenceDiagram
  participant Onboard as "onboard.js"
  participant Node as "Node Process"
  participant FS as "Filesystem (buildCtx)"
  participant Sandbox as "Sandbox"

  rect rgba(200,220,255,0.5)
  Onboard->>FS: create temporary buildCtx
  Onboard->>Node: register exit handler (cleanupBuildCtx)
  end

  rect rgba(200,255,200,0.5)
  Onboard->>Sandbox: stage files & create sandbox
  Sandbox-->>Onboard: ready / error
  end

  alt error -> process.exit called
    Node->>Node: emit exit
    Node->>FS: cleanupBuildCtx() removes buildCtx
  else success
    Onboard->>FS: call cleanupBuildCtx()
    Onboard->>Node: remove exit handler
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibbled at temp crumbs in moonlit ticks,

I tied an exit ribbon to sweep up the mix,
If chaos struck, I'd tidy every trail,
If all went well, I'd loosen the tail,
Happy burrow — no stray bits to fix.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: implementing cleanup of the build context temp directory when sandbox creation fails, which is the primary security and stability fix in this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

423-458: Use try/finally in addition to the exit hook for non-exit exceptions.

At Line 423 onward, a thrown exception (e.g., synchronous FS errors) can skip immediate cleanup and leave sensitive temp contents until process exit. Keep the exit hook, but also scope the staging/create block with try/finally for immediate teardown.

♻️ Suggested structure
   const buildCtx = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-build-"));
   const cleanupBuildCtx = () => {
     try { fs.rmSync(buildCtx, { recursive: true, force: true }); } catch {}
   };
   process.on("exit", cleanupBuildCtx);
-
-  fs.copyFileSync(path.join(ROOT, "Dockerfile"), path.join(buildCtx, "Dockerfile"));
-  run(`cp -r "${path.join(ROOT, "nemoclaw")}" "${buildCtx}/nemoclaw"`);
-  run(`cp -r "${path.join(ROOT, "nemoclaw-blueprint")}" "${buildCtx}/nemoclaw-blueprint"`);
-  run(`cp -r "${path.join(ROOT, "scripts")}" "${buildCtx}/scripts"`);
-  run(`rm -rf "${buildCtx}/nemoclaw/node_modules" "${buildCtx}/nemoclaw/src"`, { ignoreError: true });
+  try {
+    fs.copyFileSync(path.join(ROOT, "Dockerfile"), path.join(buildCtx, "Dockerfile"));
+    run(`cp -r "${path.join(ROOT, "nemoclaw")}" "${buildCtx}/nemoclaw"`);
+    run(`cp -r "${path.join(ROOT, "nemoclaw-blueprint")}" "${buildCtx}/nemoclaw-blueprint"`);
+    run(`cp -r "${path.join(ROOT, "scripts")}" "${buildCtx}/scripts"`);
+    run(`rm -rf "${buildCtx}/nemoclaw/node_modules" "${buildCtx}/nemoclaw/src"`, { ignoreError: true });
+
+    // ... sandbox create + forwarding steps ...
+  } finally {
+    cleanupBuildCtx();
+    process.removeListener("exit", cleanupBuildCtx);
+  }
-
-  // Clean up build context and deregister the exit handler
-  cleanupBuildCtx();
-  process.removeListener("exit", cleanupBuildCtx);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 423 - 458, The staging and sandbox creation
sequence (the block using run(...), the openshell sandbox create/forward calls
and related temp work that references sandboxName and uses run) must be wrapped
in a try/finally so cleanupBuildCtx() and process.removeListener("exit",
cleanupBuildCtx) run immediately on any thrown exception (not only at process
exit); keep the existing process.on("exit", cleanupBuildCtx) hook but surround
the code that copies files, builds createArgs/envArgs, calls run(`openshell
sandbox create ...`) and the forward start/stop calls with try { /* existing
code */ } finally { cleanupBuildCtx(); process.removeListener("exit",
cleanupBuildCtx); } so temporary files are removed deterministically even for
synchronous errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/onboard-build-cleanup.test.js`:
- Around line 64-66: Replace the fragile listener count check with a direct
containment check: call process.listeners("exit") and assert it does not include
the cleanup handler (the cleanup function referenced in the test), i.e., verify
!process.listeners("exit").includes(cleanup) instead of comparing
process.listenerCount("exit") > 0; update the test assertion logic around the
existing cleanup reference to ensure it specifically asserts the cleanup
function was deregistered.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 423-458: The staging and sandbox creation sequence (the block
using run(...), the openshell sandbox create/forward calls and related temp work
that references sandboxName and uses run) must be wrapped in a try/finally so
cleanupBuildCtx() and process.removeListener("exit", cleanupBuildCtx) run
immediately on any thrown exception (not only at process exit); keep the
existing process.on("exit", cleanupBuildCtx) hook but surround the code that
copies files, builds createArgs/envArgs, calls run(`openshell sandbox create
...`) and the forward start/stop calls with try { /* existing code */ } finally
{ cleanupBuildCtx(); process.removeListener("exit", cleanupBuildCtx); } so
temporary files are removed deterministically even for synchronous errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2113ebd5-afa0-40d1-bb6b-f5997b7e6982

📥 Commits

Reviewing files that changed from the base of the PR and between 9513eca and bbc2d91.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard-build-cleanup.test.js

@wscurran wscurran added security Something isn't secure NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). priority: high Important issue that should be resolved in the next release labels Mar 19, 2026
@wscurran wscurran requested a review from drobison00 March 23, 2026 16:39
Copy link
Copy Markdown
Contributor

@cv cv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is real — run() calling process.exit() bypasses try/finally, leaving build context with source code and potentially credentials in /tmp. The process.on('exit') approach is the correct fix for this.

Stale against main

The createSandbox function has changed since this PR was written. Main now uses streamSandboxCreate() instead of run() with the awk pipe, and the env args section has been restructured (shellQuote added, Discord/Slack tokens added). The PR will conflict on merge.

The fix itself is sound

process.on('exit', cleanupBuildCtx) fires even on process.exit() calls — this is the right pattern. The finally block provides cleanup on normal flow, and the exit handler catches the process.exit() path. Deregistering the handler after successful cleanup prevents double-cleanup.

Test uses node:test instead of vitest

The test file (test/onboard-build-cleanup.test.js) uses require("node:test") and require("node:assert/strict"). The repo has migrated to vitest — this will fail in CI with "No test suite found" (same issue we hit on other PRs). Needs conversion to vitest with import { describe, it, expect } from "vitest".

The behavioral tests are good

Spawning a child process that registers the exit handler and then process.exit(1) is the right way to test this — it validates actual process.exit() behavior rather than matching source patterns.

Rebase onto current main and convert the test to vitest, and this is ready to merge.

@futhgar futhgar force-pushed the fix/onboard-build-context-cleanup branch from 10b5a2a to 0aff7d6 Compare March 24, 2026 11:22
@futhgar
Copy link
Copy Markdown
Author

futhgar commented Mar 24, 2026

Thanks @cv — rebased onto current main and addressed both points:

  1. Rebased against main: Adapted to streamSandboxCreate(), shellQuote(), Discord/Slack tokens, and the ready-wait logic. The try/finally + process.on('exit') pattern wraps the full staging+creation block.

  2. Converted test to vitest: test/onboard-build-cleanup.test.js now uses import { describe, it, expect } from "vitest" — both tests pass.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

548-636: Clean the build context immediately after a successful create.

After streamSandboxCreate() returns 0, the readiness poll and port-forward setup no longer use buildCtx. Keeping the temp tree around until the outer finally extends the on-disk exposure window by up to ~60s on the success path.

♻️ Suggested reshape
-  try {
+  try {
     fs.copyFileSync(path.join(ROOT, "Dockerfile"), path.join(buildCtx, "Dockerfile"));
     run(`cp -r "${path.join(ROOT, "nemoclaw")}" "${buildCtx}/nemoclaw"`);
     run(`cp -r "${path.join(ROOT, "nemoclaw-blueprint")}" "${buildCtx}/nemoclaw-blueprint"`);
     run(`cp -r "${path.join(ROOT, "scripts")}" "${buildCtx}/scripts"`);
     run(`rm -rf "${buildCtx}/nemoclaw/node_modules"`, { ignoreError: true });
@@
     if (createResult.status !== 0) {
       console.error("");
       console.error(`  Sandbox creation failed (exit ${createResult.status}).`);
       if (createResult.output) {
         console.error("");
         console.error(createResult.output);
       }
       console.error("  Try:  openshell sandbox list        # check gateway state");
       console.error("  Try:  nemoclaw onboard              # retry from scratch");
       process.exit(createResult.status || 1);
     }
+  } finally {
+    cleanupBuildCtx();
+    process.removeListener("exit", cleanupBuildCtx);
+  }
 
-    // Wait for sandbox to reach Ready state in k3s before registering.
+  // Wait for sandbox to reach Ready state in k3s before registering.
     console.log("  Waiting for sandbox to become ready...");
     let ready = false;
     for (let i = 0; i < 30; i++) {
       const list = runCapture("openshell sandbox list 2>&1", { ignoreError: true });
       if (isSandboxReady(list, sandboxName)) {
         ready = true;
         break;
       }
       require("child_process").spawnSync("sleep", ["2"]);
     }
@@
-    run(`openshell forward start --background 18789 "${sandboxName}"`, { ignoreError: true });
-  } finally {
-    cleanupBuildCtx();
-    process.removeListener("exit", cleanupBuildCtx);
-  }
+  run(`openshell forward start --background 18789 "${sandboxName}"`, { ignoreError: true });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 548 - 636, The build context (buildCtx) is
only cleaned in the outer finally, leaving temporary files on disk until
readiness polling finishes; after streamSandboxCreate() returns success you
should immediately call cleanupBuildCtx() and remove the exit listener so the
temp tree is removed early — modify the block after checking createResult.status
=== 0 (i.e., after the const createResult = await streamSandboxCreate(...) and
the failure branch) to invoke cleanupBuildCtx() and
process.removeListener("exit", cleanupBuildCtx) before proceeding to
isSandboxReady polling and port-forward setup so buildCtx is removed on the
successful path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 548-636: The build context (buildCtx) is only cleaned in the outer
finally, leaving temporary files on disk until readiness polling finishes; after
streamSandboxCreate() returns success you should immediately call
cleanupBuildCtx() and remove the exit listener so the temp tree is removed early
— modify the block after checking createResult.status === 0 (i.e., after the
const createResult = await streamSandboxCreate(...) and the failure branch) to
invoke cleanupBuildCtx() and process.removeListener("exit", cleanupBuildCtx)
before proceeding to isSandboxReady polling and port-forward setup so buildCtx
is removed on the successful path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 55f158a3-99c5-442d-9844-9f2298e22dab

📥 Commits

Reviewing files that changed from the base of the PR and between 10b5a2a and 0aff7d6.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard-build-cleanup.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard-build-cleanup.test.js

@drobison00 drobison00 self-assigned this Mar 24, 2026
…lure

Register a process 'exit' handler to guarantee the build context temp
directory is removed even when run() calls process.exit() on command
failure (which bypasses try/finally). Wrap the sandbox creation block
in try/finally for immediate cleanup on sync exceptions and on the
success path.

Signed-off-by: Josue Gomez <josue@guatulab.com>
@futhgar futhgar force-pushed the fix/onboard-build-context-cleanup branch from 0aff7d6 to de0c943 Compare March 28, 2026 21:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1769-1772: The inline cleanup function cleanupBuildCtx currently
swallows fs.rmSync errors and the exit handler is unconditionally removed later;
change cleanupBuildCtx to return a boolean indicating success (true when
fs.rmSync deletes the buildCtx, false when it fails) and log the caught error
instead of silencing it, then only call process.removeListener/removeHandler for
the "exit" event when cleanupBuildCtx() returns true so the fallback exit
handler remains registered after an inline failure; update all places where the
listener is deregistered (the code that calls process.off/process.removeListener
with cleanupBuildCtx) to first invoke cleanupBuildCtx and conditionally remove
the listener on success.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 55da4595-1ec3-42f8-a850-c4486e0921c3

📥 Commits

Reviewing files that changed from the base of the PR and between 0aff7d6 and de0c943.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard-build-cleanup.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard-build-cleanup.test.js

Comment on lines +1769 to +1772
const cleanupBuildCtx = () => {
try { fs.rmSync(buildCtx, { recursive: true, force: true }); } catch {}
};
process.on("exit", cleanupBuildCtx);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't drop the fallback cleanup after a failed rmSync.

Line 1770 swallows every fs.rmSync() failure, but Line 1883 removes the "exit" handler unconditionally. If inline cleanup fails on the success path, /tmp/nemoclaw-build-* is left behind and there is no retry at process shutdown. Have cleanupBuildCtx() report success/failure and only deregister the listener after a successful removal.

Suggested change
-  const cleanupBuildCtx = () => {
-    try { fs.rmSync(buildCtx, { recursive: true, force: true }); } catch {}
-  };
+  const cleanupBuildCtx = () => {
+    try {
+      fs.rmSync(buildCtx, { recursive: true, force: true });
+      return true;
+    } catch (error) {
+      console.error(`  Failed to remove temporary build context '${buildCtx}': ${error.message}`);
+      return false;
+    }
+  };
...
-  } finally {
-    cleanupBuildCtx();
-    process.removeListener("exit", cleanupBuildCtx);
-  }
+  } finally {
+    if (cleanupBuildCtx()) {
+      process.removeListener("exit", cleanupBuildCtx);
+    }
+  }

Also applies to: 1881-1883

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1769 - 1772, The inline cleanup function
cleanupBuildCtx currently swallows fs.rmSync errors and the exit handler is
unconditionally removed later; change cleanupBuildCtx to return a boolean
indicating success (true when fs.rmSync deletes the buildCtx, false when it
fails) and log the caught error instead of silencing it, then only call
process.removeListener/removeHandler for the "exit" event when cleanupBuildCtx()
returns true so the fallback exit handler remains registered after an inline
failure; update all places where the listener is deregistered (the code that
calls process.off/process.removeListener with cleanupBuildCtx) to first invoke
cleanupBuildCtx and conditionally remove the listener on success.

mafueee pushed a commit to mafueee/NemoClaw that referenced this pull request Mar 28, 2026
chore: add vouch system for first-time contributors
cleanupBuildCtx now returns a boolean so the process exit handler is
only deregistered when rmSync actually succeeds. If inline cleanup
fails (e.g. EPERM), the exit handler stays registered as a safety net
to retry removal on process exit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). priority: high Important issue that should be resolved in the next release security Something isn't secure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants