Skip to content

fix: harden installer and onboard resiliency#961

Merged
cv merged 12 commits intomainfrom
fix/446-installer-resiliency-signed
Mar 27, 2026
Merged

fix: harden installer and onboard resiliency#961
cv merged 12 commits intomainfrom
fix/446-installer-resiliency-signed

Conversation

@kjw3
Copy link
Copy Markdown
Contributor

@kjw3 kjw3 commented Mar 26, 2026

Summary

Hardens NemoClaw install and onboarding resiliency across repeat runs, partial failures, and restart/runtime drift.

Supersedes #770 with a clean signed-commit branch.

Addresses:

What Changed

  • Added persisted onboarding session state and nemoclaw onboard --resume
  • Added sanitized onboarding state to nemoclaw debug
  • Made repeat onboard non-destructive by default when the sandbox is healthy
  • Improved recovery and messaging for gateway/runtime drift after restart
  • Improved sandbox image transfer failure classification and recovery guidance
  • Hardened installer/uninstall behavior and related shell/session handling
  • Tightened session-state redaction and secret handling
  • Aligned onboard OpenClaw UI output with current main

How To Test

Start from the PR branch:

git fetch origin
git switch fix/446-installer-resiliency-signed
git pull --ff-only origin fix/446-installer-resiliency-signed

Please focus on resiliency and recovery behavior:

  1. Fresh onboard

    • ./install.sh
    • nemoclaw onboard
    • Verify nemoclaw <sandbox> status and nemoclaw <sandbox> connect
  2. Interrupted onboard + resume

    • Interrupt or fail onboard mid-run
    • Fix the underlying issue
    • Run nemoclaw onboard --resume
    • Verify completed steps are skipped when safe and onboarding completes without full cleanup
  3. Repeat onboard

    • Run onboard again with the same sandbox name
    • Verify a healthy sandbox is reused, not silently recreated
    • Run onboard with a different sandbox name
    • Verify the old sandbox remains intact
  4. Runtime/gateway drift

    • After onboarding, restart the relevant runtime/gateway context if feasible
    • Run nemoclaw <sandbox> status and nemoclaw <sandbox> connect
    • Verify NemoClaw attempts recovery first and falls back to clear onboard --resume guidance when needed
  5. Debug visibility

    • nemoclaw debug --quick
    • Verify onboarding session state is present and secrets are not exposed

Please include in any bug report:

  • platform/environment
  • exact command run
  • full terminal output
  • nemoclaw debug --quick
  • openshell status
  • openshell sandbox list
  • openshell gateway info

Validation

npx vitest run test/runtime-recovery.test.js test/onboard.test.js test/onboard-session.test.js test/cli.test.js test/install-preflight.test.js test/uninstall.test.js
npx vitest run --coverage
npx tsc -p jsconfig.json --noEmit

Summary by CodeRabbit

  • New Features

    • Resumable onboarding: use "nemoclaw onboard --resume" to continue or repair interrupted setup; resume skips already-completed steps when safe.
    • Improved recovery guidance that detects gateway/sandbox states and suggests restart or recreation actions.
  • Bug Fixes

    • More reliable gateway/sandbox health detection and clearer sandbox-create recovery hints.
    • Stronger resume validation to prevent invalid/conflicting resumes.
  • Chores

    • Installer switched to a compatibility wrapper with clearer PATH/profile messaging; uninstall now cleans onboard session state; debug includes an onboard session summary.
  • Tests

    • Expanded unit and E2E tests covering resume, recovery, and session persistence.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new atomic, resumable onboard-session module with cross-process locking and redaction; integrates session-driven resume and recovery into onboarding, augments gateway/sandbox classification and repair logic, updates installer/wrapper and debug hooks, and adds many unit, integration, and E2E tests.

Changes

Cohort / File(s) Summary
Session Persistence & Locking
bin/lib/onboard-session.js, test/onboard-session.test.js
New module persisting atomic JSON at ~/.nemoclaw/onboard-session.json with exclusive lock ~/.nemoclaw/onboard.lock; provides create/load/save, step lifecycle (start/complete/fail), stale-lock detection, redaction of secrets, atomic temp-file writes, and comprehensive tests.
Onboard Flow & State Logic
bin/lib/onboard.js, test/onboard.test.js
Onboard now uses session persistence and lock-driven resume; richer gateway/sandbox reuse and health classification, sandbox repair/recreate, build-context filtering/copy helpers, credential hydration for inference, resume conflict detection, adjusted step ordering, and expanded unit tests.
Runtime Recovery Helpers
bin/lib/runtime-recovery.js, test/runtime-recovery.test.js, test/nemoclaw-cli-recovery.test.js
New parser/classifier utilities for OpenShell outputs (sandbox names, sandbox/gateway state), decision predicate for whether to attempt recovery, and a helper to select recovery command; unit and CLI recovery integration tests added.
CLI Flags, Entrypoints & Installer
bin/nemoclaw.js, install.sh, scripts/install.sh, scripts/debug.sh
Adds --resume handling to CLI, installer replaced by repo-wrapper that execs root install.sh, installer detects interrupted sessions to auto-run onboard --resume, debug script collects onboard-session summary, and messaging for PATH/profile recovery adjusted.
Sandbox Creation & Recovery
bin/lib/onboard.js (helpers)
Adds sandbox create failure classification, sandbox state inference, repairRecordedSandbox, sandbox create enhancements (optional name override, filtered build-context copy), and targeted recovery hint printing.
Credential & Policy Handling
bin/lib/onboard.js
Introduces hydrateCredentialEnv, uses hydrated secrets for inference setup, changes policy preset flow to be selection/resume-aware, and adds readiness checks for inference/OpenClaw.
E2E Resume/Repair Tests
test/e2e/test-onboard-resume.sh, test/e2e/test-onboard-repair.sh
New Bash E2E suites validating interrupted-run creation, resume repair and invalidation, sandbox recreation, credential persistence/hydration, and full end-to-end log/state assertions.
Tests & Debugging
test/cli.test.js, test/install-preflight.test.js, test/onboard-session.test.js, test/onboard.test.js, test/runtime-recovery.test.js, test/nemoclaw-cli-recovery.test.js, test/uninstall.test.js
Wide-ranging test additions and updates: CLI parsing, installer wrapper behavior, session lifecycle and redaction, runtime recovery classifiers, resume conflict cases, credential hydration checks, and uninstall/session-cleanup tests.
Uninstall Behavior
uninstall.sh, test/uninstall.test.js
Uninstall docs/behavior updated to remove onboard session state; shim removal now deletes only symlinks (leaves regular files) and emits a warning; tests adjusted accordingly.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as "nemoclaw CLI"
    participant Lock as "Lock Manager\n(~/.nemoclaw/onboard.lock)"
    participant Session as "Session Storage\n(~/.nemoclaw/onboard-session.json)"
    participant Onboard as "Onboard Flow"
    participant Gateway as "Gateway/OpenShell"
    participant Sandbox as "Sandbox/OpenShell"

    User->>CLI: run "nemoclaw onboard" / "nemoclaw onboard --resume"
    CLI->>Lock: acquireOnboardLock(command)
    Lock->>Lock: create/check lock file (pid,startTime,cmd)
    Lock-->>CLI: acquired | holder metadata | stale-handled
    CLI->>Session: loadSession() or createSession()
    CLI->>Onboard: execute steps
    Onboard->>Gateway: start/check health (getGatewayStartEnv, getGatewayReuseState)
    Onboard->>Sandbox: create/check/repair (createSandbox, repairRecordedSandbox)
    Onboard->>Session: markStepStarted/Complete/Failed
    Onboard->>Session: saveSession() (atomic write + redact)
    CLI->>Lock: releaseOnboardLock()
Loading
sequenceDiagram
    actor User
    participant CLI as "nemoclaw CLI"
    participant Session as "Session Storage"
    participant Recovery as "Runtime Recovery"
    participant Gateway as "Gateway/OpenShell"
    participant Sandbox as "Sandbox/OpenShell"

    User->>CLI: "nemoclaw onboard --resume"
    CLI->>Session: loadSession()
    Session-->>CLI: session (resumable? recorded steps)
    CLI->>Recovery: classifyGatewayStatus(), classifySandboxLookup()
    alt Recoverable (no conflicts)
        CLI->>Gateway: skip/start/select based on state
        CLI->>Sandbox: skip/recreate based on state
        CLI->>Onboard: continue remaining steps and save progress
    else ConflictsDetected
        CLI-->>User: abort resume with conflict errors
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through lines and left a careful trace,

Locked the burrow tight, saved progress in its place,
Secrets gently masked, timestamps snug and warm,
Resume and repair — I nudge the flow back to form,
A tiny rabbit cheering every saved restart.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: harden installer and onboard resiliency' clearly and specifically summarizes the main changes across the PR, which involve improving installer robustness, onboarding session persistence, resume capabilities, and runtime recovery.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/446-installer-resiliency-signed

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/uninstall.test.js (1)

75-116: ⚠️ Potential issue | 🟠 Major

Stub npm in these shim tests.

Both tests source uninstall.sh and call remove_nemoclaw_cli(), which still runs npm unlink -g nemoclaw / npm uninstall -g nemoclaw before it inspects the shim. With the current env, that hits whatever npm is on the host PATH, so the suite can mutate the developer/CI global install. Put a fake npm earlier in PATH here, the same way the --yes test already does.

🧪 Minimal hardening
     const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-uninstall-preserve-"));
+    const fakeBin = path.join(tmp, "bin");
     const shimDir = path.join(tmp, ".local", "bin");
     const shimPath = path.join(shimDir, "nemoclaw");
+    fs.mkdirSync(fakeBin, { recursive: true });
     fs.mkdirSync(shimDir, { recursive: true });
     fs.writeFileSync(shimPath, "#!/usr/bin/env bash\n", { mode: 0o755 });
+    fs.writeFileSync(path.join(fakeBin, "npm"), "#!/usr/bin/env bash\nexit 0\n", { mode: 0o755 });

     const result = spawnSync(
       "bash",
       ["-lc", `HOME="${tmp}" source "${UNINSTALL_SCRIPT}"; remove_nemoclaw_cli`],
       {
         cwd: path.join(import.meta.dirname, ".."),
         encoding: "utf-8",
+        env: { ...process.env, PATH: `${fakeBin}:/usr/bin:/bin` },
       },
     );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/uninstall.test.js` around lines 75 - 116, The tests invoking
remove_nemoclaw_cli (via sourcing UNINSTALL_SCRIPT in the spawnSync bash
command) must stub npm on PATH so the script's npm unlink/uninstall calls don't
invoke the host npm; create a temp "bin" directory, write a small fake "npm"
executable (exit 0 or mimic expected output) and make it executable, prepend
that bin dir to PATH in the env passed to spawnSync (as done in the --yes test),
and use that env for both spawnSync calls that source UNINSTALL_SCRIPT so the
installer-managed shim logic is exercised without touching the real global npm.
🧹 Nitpick comments (4)
scripts/install.sh (1)

15-23: Write the deprecation banner to stderr.

This wrapper currently prepends warning text to stdout before execing the real installer. That changes the stdout contract of delegated commands like --help / --version; sending the banner to stderr keeps the warning without polluting the underlying installer output.

🪄 Small cleanup
 warn_legacy_path() {
-  cat <<EOF
+  cat <<EOF >&2
 [install] deprecated compatibility wrapper: scripts/install.sh
 [install] supported installer: ${ROOT_INSTALLER_URL}
 EOF
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/install.sh` around lines 15 - 23, The deprecation banner currently
writes to stdout via warn_legacy_path (the heredoc using cat <<EOF) and then
calls warn_legacy_path; change the function so its heredoc writes to stderr
(e.g., redirect the heredoc/cat output to >&2 or use printf to >&2) and keep the
existing warn_legacy_path invocation so the banner is emitted to stderr before
execing the real installer; ensure the message text and variable
${ROOT_INSTALLER_URL} remain unchanged.
install.sh (1)

273-281: Minor: detect_shell_profile could miss fish/other shells.

The function defaults to .bashrc for non-zsh shells. Users running fish, tcsh, or other shells won't get accurate profile suggestions. This is a reasonable simplification for the common case, but consider adding a note in the output or documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@install.sh` around lines 273 - 281, The detect_shell_profile function
currently defaults non-zsh shells to ~/.bashrc which can misidentify users of
fish, tcsh, etc.; update detect_shell_profile to check $SHELL basename for
common shells like fish (use ~/.config/fish/config.fish), tcsh/csh (use
~/.tcshrc or ~/.cshrc), and fallback to ~/.profile or print a visible note when
the shell is unrecognized; reference the detect_shell_profile function and
adjust its branching logic to add these checks and/or emit a comment suggesting
manual profile selection when an unknown shell is detected.
test/onboard-session.test.js (1)

10-14: Consider potential test isolation concern with global HOME modification.

Setting process.env.HOME at module load time (Line 11) affects the entire Node process for the duration of the test run. While this works correctly when this test file runs in isolation, it could cause issues if:

  1. Other test files in the same Vitest worker expect the original HOME
  2. The module under test is cached with the modified HOME path

This is likely fine given Vitest's default worker isolation, but worth noting if flaky behavior appears.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard-session.test.js` around lines 10 - 14, The test mutates the
global process.env.HOME at module load (tmpDir and process.env.HOME) and then
requires the module via createRequire(import.meta.url) +
require("../bin/lib/onboard-session"), which can leak into other tests or cache
the module with the wrong HOME; fix by saving the original HOME before changing
it, set process.env.HOME to tmpDir only immediately before requiring the module
(or inside a beforeEach), require the module (session) while HOME is modified,
then restore the original HOME immediately after (or in an afterEach) so other
tests see the original environment and the module cache doesn’t persist an
unexpected HOME.
test/onboard.test.js (1)

372-411: Consider adding a clarifying comment for the provider mapping chain.

The test validates that "cloud" provider → "nvidia-prod" effective provider, but the two-stage mapping (via getRequestedProviderHintgetEffectiveProviderName) isn't immediately clear. A brief comment explaining this alias chain would improve maintainability.

📝 Suggested clarifying comment
   it("detects resume conflicts for explicit provider and model changes", () => {
     const previousProvider = process.env.NEMOCLAW_PROVIDER;
     const previousModel = process.env.NEMOCLAW_MODEL;
+    // "cloud" is an alias that maps through getRequestedProviderHint ("build")
+    // to getEffectiveProviderName ("nvidia-prod")
     process.env.NEMOCLAW_PROVIDER = "cloud";
     process.env.NEMOCLAW_MODEL = "nvidia/other-model";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 372 - 411, Add a brief clarifying comment
in the test near the setup of process.env.NEMOCLAW_PROVIDER explaining the
two-stage provider alias resolution used by the code under test: that
getRequestedProviderHint maps the external alias ("cloud") to an internal
requested hint and then getEffectiveProviderName resolves that hint to the
effective provider name ("nvidia-prod"), so the test expects provider conflict
between requested "nvidia-prod" and recorded "nvidia-nim"; place this comment
adjacent to the getResumeConfigConflicts invocation (or the environment setup)
to make the alias chain explicit for future readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/debug.sh`:
- Around line 110-112: The current derivation of SCRIPT_DIR, REPO_ROOT and
ONBOARD_SESSION_HELPER can resolve relative to the caller when the script is run
via "bash -s", so update the logic that sets SCRIPT_DIR and REPO_ROOT to first
verify that BASH_SOURCE[0] is non-empty and points to an existing file; if not,
leave ONBOARD_SESSION_HELPER empty (or unset) so it cannot point at a caller's
../bin/lib/onboard-session.js. Then gate the helper usage (the conditional that
uses ONBOARD_SESSION_HELPER / require('./../bin/lib/onboard-session.js')) on
ONBOARD_SESSION_HELPER being non-empty/real. Refer to the variables SCRIPT_DIR,
REPO_ROOT, and ONBOARD_SESSION_HELPER and the helper-require conditional to
implement these checks.

---

Outside diff comments:
In `@test/uninstall.test.js`:
- Around line 75-116: The tests invoking remove_nemoclaw_cli (via sourcing
UNINSTALL_SCRIPT in the spawnSync bash command) must stub npm on PATH so the
script's npm unlink/uninstall calls don't invoke the host npm; create a temp
"bin" directory, write a small fake "npm" executable (exit 0 or mimic expected
output) and make it executable, prepend that bin dir to PATH in the env passed
to spawnSync (as done in the --yes test), and use that env for both spawnSync
calls that source UNINSTALL_SCRIPT so the installer-managed shim logic is
exercised without touching the real global npm.

---

Nitpick comments:
In `@install.sh`:
- Around line 273-281: The detect_shell_profile function currently defaults
non-zsh shells to ~/.bashrc which can misidentify users of fish, tcsh, etc.;
update detect_shell_profile to check $SHELL basename for common shells like fish
(use ~/.config/fish/config.fish), tcsh/csh (use ~/.tcshrc or ~/.cshrc), and
fallback to ~/.profile or print a visible note when the shell is unrecognized;
reference the detect_shell_profile function and adjust its branching logic to
add these checks and/or emit a comment suggesting manual profile selection when
an unknown shell is detected.

In `@scripts/install.sh`:
- Around line 15-23: The deprecation banner currently writes to stdout via
warn_legacy_path (the heredoc using cat <<EOF) and then calls warn_legacy_path;
change the function so its heredoc writes to stderr (e.g., redirect the
heredoc/cat output to >&2 or use printf to >&2) and keep the existing
warn_legacy_path invocation so the banner is emitted to stderr before execing
the real installer; ensure the message text and variable ${ROOT_INSTALLER_URL}
remain unchanged.

In `@test/onboard-session.test.js`:
- Around line 10-14: The test mutates the global process.env.HOME at module load
(tmpDir and process.env.HOME) and then requires the module via
createRequire(import.meta.url) + require("../bin/lib/onboard-session"), which
can leak into other tests or cache the module with the wrong HOME; fix by saving
the original HOME before changing it, set process.env.HOME to tmpDir only
immediately before requiring the module (or inside a beforeEach), require the
module (session) while HOME is modified, then restore the original HOME
immediately after (or in an afterEach) so other tests see the original
environment and the module cache doesn’t persist an unexpected HOME.

In `@test/onboard.test.js`:
- Around line 372-411: Add a brief clarifying comment in the test near the setup
of process.env.NEMOCLAW_PROVIDER explaining the two-stage provider alias
resolution used by the code under test: that getRequestedProviderHint maps the
external alias ("cloud") to an internal requested hint and then
getEffectiveProviderName resolves that hint to the effective provider name
("nvidia-prod"), so the test expects provider conflict between requested
"nvidia-prod" and recorded "nvidia-nim"; place this comment adjacent to the
getResumeConfigConflicts invocation (or the environment setup) to make the alias
chain explicit for future readers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: db04f9b7-e31f-4bdb-822a-e593c19f4f0e

📥 Commits

Reviewing files that changed from the base of the PR and between f0f53e4 and 7b70543.

📒 Files selected for processing (17)
  • bin/lib/onboard-session.js
  • bin/lib/onboard.js
  • bin/lib/runtime-recovery.js
  • bin/nemoclaw.js
  • install.sh
  • scripts/debug.sh
  • scripts/install.sh
  • test/cli.test.js
  • test/e2e/test-onboard-repair.sh
  • test/e2e/test-onboard-resume.sh
  • test/install-preflight.test.js
  • test/nemoclaw-cli-recovery.test.js
  • test/onboard-session.test.js
  • test/onboard.test.js
  • test/runtime-recovery.test.js
  • test/uninstall.test.js
  • uninstall.sh

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/onboard.test.js (1)

927-932: Consider using findLast() for cleaner code.

The .slice().reverse().find() pattern works correctly, but can be simplified with Array.prototype.findLast() (available in Node.js 18+). The same pattern appears at lines 1013-1018.

✨ Suggested simplification
     const payloadLine = result.stdout
       .trim()
       .split("\n")
-      .slice()
-      .reverse()
-      .find((line) => line.startsWith("{") && line.endsWith("}"));
+      .findLast((line) => line.startsWith("{") && line.endsWith("}"));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 927 - 932, Replace the
".slice().reverse().find()" arrays used to locate the JSON payload with the
simpler Array.prototype.findLast() to improve readability: change the assignment
to payloadLine (the expression starting from
result.stdout.trim().split("\n").slice().reverse().find(...)) to use
.findLast(line => line.startsWith("{") && line.endsWith("}")) instead, and make
the same replacement for the other identical pattern later in the file (the
second result.stdout processing around the other payload extraction). Ensure
Node.js 18+ runtime compatibility is acceptable before applying.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/onboard.test.js`:
- Around line 927-932: Replace the ".slice().reverse().find()" arrays used to
locate the JSON payload with the simpler Array.prototype.findLast() to improve
readability: change the assignment to payloadLine (the expression starting from
result.stdout.trim().split("\n").slice().reverse().find(...)) to use
.findLast(line => line.startsWith("{") && line.endsWith("}")) instead, and make
the same replacement for the other identical pattern later in the file (the
second result.stdout processing around the other payload extraction). Ensure
Node.js 18+ runtime compatibility is acceptable before applying.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2bb959e9-9326-4a12-ad90-0daaaca7e29a

📥 Commits

Reviewing files that changed from the base of the PR and between 7b70543 and 345f8c6.

📒 Files selected for processing (9)
  • bin/lib/onboard-session.js
  • bin/lib/onboard.js
  • install.sh
  • scripts/debug.sh
  • scripts/install.sh
  • test/install-preflight.test.js
  • test/onboard-session.test.js
  • test/onboard.test.js
  • test/uninstall.test.js
✅ Files skipped from review due to trivial changes (3)
  • scripts/debug.sh
  • test/uninstall.test.js
  • bin/lib/onboard.js
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/onboard-session.test.js
  • test/install-preflight.test.js
  • bin/lib/onboard-session.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard-session.js`:
- Line 320: The code currently assigns updates.endpointUrl directly to
safe.endpointUrl (and later returns it from summarizeForDebug()), which can
expose credentials; update the assignment path that sets safe.endpointUrl (and
any similar assignment around the other occurrences) to first run the URL
through a redaction helper (create a small helper used by summarizeForDebug()
like redactUrl(url)): parse the URL, remove or mask userinfo
(username:password@), and also strip or mask common sensitive query parameters
(e.g., signature, sig, token, auth, access_token) and then set safe.endpointUrl
to the redacted string so summarizeForDebug() never returns credential-bearing
URLs.
- Around line 305-310: completeSession currently applies filterSafeUpdates and
sets status/failure but never flips the session's resumable flag, so sessions
created with createSession (which defaults resumable to true) remain resumable
after completion; modify completeSession (the function that calls updateSession
and uses filterSafeUpdates) to explicitly set session.resumable = false after
applying safe updates so completed sessions are not considered resumable by
downstream recovery code.
- Around line 208-249: When a concurrent writer leaves the lock file unreadable
or briefly missing, the current loop treats parseLockFile(null) or readFileSync
ENOENT as stale and unlinks a still-live lock; change the logic in the
lock-acquisition loop (the block that does fs.openSync(LOCK_FILE, "wx", ...),
parseLockFile(...), and isProcessAlive(...)) so that if readFileSync throws
ENOENT or parseLockFile(...) returns null you DO NOT unlink the file but instead
treat it as a transient race and retry (i.e., continue to the next attempt after
a short backoff), and only unlink the LOCK_FILE when parseLockFile returned a
valid payload and isProcessAlive(...) is false; also update releaseOnboardLock
so it ignores ENOENT from readFileSync and does not unlink when parseLockFile
returns null (only unlink when parsed payload has matching pid).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5a038dc4-fa76-4a43-866e-10b03c3aa9c4

📥 Commits

Reviewing files that changed from the base of the PR and between d9b0634 and d8174d7.

📒 Files selected for processing (1)
  • bin/lib/onboard-session.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
bin/lib/onboard-session.js (2)

97-113: ⚠️ Potential issue | 🟠 Major

Redact fragments and provider-signed query params too.

Line 105 only masks exact key matches, so URLs like ...?X-Amz-Signature=... / ...?X-Goog-Signature=... and fragment tokens such as #access_token=... still survive into the saved session and nemoclaw debug. Strip url.hash and broaden the sensitive-key match here.

🧼 Suggested hardening
+function isSensitiveUrlParam(key) {
+  return (
+    /^(x-amz-|x-goog-)/i.test(key) ||
+    /(^|[-_])(sig|signature|token|auth|credential|key)([-_]|$)/i.test(key)
+  );
+}
+
 function redactUrl(value) {
   if (typeof value !== "string" || value.length === 0) return null;
   try {
     const url = new URL(value);
@@
-    for (const key of [...url.searchParams.keys()]) {
-      if (/^(signature|sig|token|auth|access_token)$/i.test(key)) {
+    url.hash = "";
+    for (const key of [...url.searchParams.keys()]) {
+      if (isSensitiveUrlParam(key)) {
         url.searchParams.set(key, "<REDACTED>");
       }
     }
     return url.toString();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard-session.js` around lines 97 - 113, The redactUrl function
currently only masks exact query key matches and leaves URL fragments and
provider-prefixed signature keys intact; update redactUrl to also clear url.hash
(to remove fragment tokens) and broaden the search-key matching to include
common prefixed variations like
/(^|[-_])(?:signature|sig|token|auth|access_token)$/i or similar so keys such as
X-Amz-Signature and X-Goog-Signature are caught, iterating url.searchParams to
set matched keys to "<REDACTED>" (as it already does) and falling back to
redactSensitiveText(value) on parse errors; reference redactUrl, url.hash,
url.searchParams, and redactSensitiveText when making the change.

227-271: ⚠️ Potential issue | 🟠 Major

Keep the lock path fail-closed under partial writes.

Line 230 creates the lock path before its JSON payload is complete. If that write throws, Line 231 never runs, so the fd stays open and the half-written lock is left behind; and if a competing reader hits that window twice, this path still falls through to { stale: true } even though the owner may still be alive. Please close and clean up the creator path in a finally, and only report stale: true after successfully parsing a dead-owner record.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard-session.js`:
- Around line 298-309: markStepStarted and markStepFailed currently leave an old
completedAt timestamp on the step which makes non-complete states appear
completed; update both functions (markStepStarted and markStepFailed) inside the
updateSession callback where you mutate session.steps[stepName] (the step
object) to explicitly clear step.completedAt (e.g., set to null or undefined)
whenever you transition the step.status away from "complete", keeping the
existing changes to status, startedAt, error, failure, and session.status as is.
- Around line 365-369: The helper currently sets safe.metadata = {} whenever
isObject(updates.metadata) is true, which causes existing metadata.gatewayName
to be dropped if updates.metadata lacks gatewayName; change the logic in the
metadata handling (the isObject(updates.metadata) block) to only set
safe.metadata when a whitelisted field is present—check for typeof
updates.metadata.gatewayName === "string" and only then initialize safe.metadata
and assign safe.metadata.gatewayName, avoiding creating an empty safe.metadata
that would be merged away by Object.assign.

---

Duplicate comments:
In `@bin/lib/onboard-session.js`:
- Around line 97-113: The redactUrl function currently only masks exact query
key matches and leaves URL fragments and provider-prefixed signature keys
intact; update redactUrl to also clear url.hash (to remove fragment tokens) and
broaden the search-key matching to include common prefixed variations like
/(^|[-_])(?:signature|sig|token|auth|access_token)$/i or similar so keys such as
X-Amz-Signature and X-Goog-Signature are caught, iterating url.searchParams to
set matched keys to "<REDACTED>" (as it already does) and falling back to
redactSensitiveText(value) on parse errors; reference redactUrl, url.hash,
url.searchParams, and redactSensitiveText when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 786f3657-8be4-4a4c-a389-7f372633331e

📥 Commits

Reviewing files that changed from the base of the PR and between d8174d7 and 49a02f2.

📒 Files selected for processing (2)
  • bin/lib/onboard-session.js
  • test/onboard-session.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard-session.test.js

cv

This comment was marked as duplicate.

@cv
Copy link
Copy Markdown
Contributor

cv commented Mar 27, 2026

Nit: _setupPolicies (lines 2373–2511 in bin/lib/onboard.js) is dead code — nothing calls it now that setupPoliciesWithSelection has fully replaced it. Worth removing to avoid confusion, but fine as a follow-up.

@cv cv merged commit a03eda0 into main Mar 27, 2026
8 checks passed
@cv cv deleted the fix/446-installer-resiliency-signed branch March 27, 2026 17:31
TSavo pushed a commit to wopr-network/nemoclaw that referenced this pull request Mar 28, 2026
* fix: improve gateway lifecycle recovery (NVIDIA#953)

* fix: improve gateway lifecycle recovery

* docs: fix readme markdown list spacing

* fix: tighten gateway lifecycle review follow-ups

* fix: simplify tokenized control ui output

* fix: restore chat route in control ui urls

* refactor: simplify ansi stripping in onboard

* fix: shorten control ui url output

* fix: move control ui below cli next steps

* fix: swap hard/soft ulimit settings in start script (NVIDIA#951)

Fixes NVIDIA#949

Co-authored-by: KJ <kejones@nvidia.com>

* chore: add cyclomatic complexity lint rule (NVIDIA#875)

* chore: add cyclomatic complexity rule (ratchet from 95)

Add ESLint complexity rule to bin/ and scripts/ to prevent new
functions from accumulating excessive branching. Starting threshold
is 95 (current worst offender: setupNim in onboard.js). Ratchet
plan: 95 → 40 → 25 → 15.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: ratchet complexity to 20, suppress existing violations

Suppress 6 functions that exceed the threshold with eslint-disable
comments so we can start enforcing at 20 instead of 95:

- setupNim (95), setupPolicies (41), setupInference (22) in onboard.js
- deploy (22), main IIFE (27) in nemoclaw.js
- applyPreset (24) in policies.js

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: suppress complexity for 3 missed functions

preflight (23), getReconciledSandboxGatewayState (25), sandboxStatus (27)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add host-side config and state file locations to README (NVIDIA#903)

Signed-off-by: peteryuqin <peter.yuqin@gmail.com>

* chore: add tsconfig.cli.json, root execa, TS coverage ratchet (NVIDIA#913)

* chore: add tsconfig.cli.json, root execa, TS coverage ratchet

Foundation for the CLI TypeScript migration (PR 0 of the shell
consolidation plan). No runtime changes — config, tooling, and
dependency only.

- tsconfig.cli.json: strict TS type-checking for bin/ and scripts/
  (noEmit, module: preserve — tsx handles the runtime)
- scripts/check-coverage-ratchet.ts: pure TS replacement for the
  bash+python coverage ratchet script (same logic, same tolerance)
- execa ^9.6.1 added to root devDependencies (used by PR 1+)
- pr.yaml: coverage ratchet step now runs the TS version via tsx
- .pre-commit-config.yaml: SPDX headers cover scripts/*.ts,
  new tsc-check-cli pre-push hook
- CONTRIBUTING.md: document typecheck:cli task and CLI pre-push hook
- Delete scripts/check-coverage-ratchet.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Apply suggestion from @brandonpelfrey

* chore: address PR feedback — use types_or, add tsx devDep

- Use `types_or: [ts, tsx]` instead of file glob for tsc-check-cli
  hook per @brandonpelfrey's suggestion.
- Add `tsx` to devDependencies so CI doesn't re-fetch it on every run
  per CodeRabbit's suggestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): ignore GitHub "Apply suggestion" commits in commitlint

* fix(ci): lint only PR title since repo is squash-merge only

Reverts the commitlint ignores rule from the previous commit and
instead removes the per-commit lint step entirely.

Individual commit messages are discarded at merge time — only the
squash-merged PR title lands in main and drives changelog generation.
Drop the per-commit lint, keep the PR title check, and remove the
now-unnecessary fetch-depth: 0.

* Revert "fix(ci): lint only PR title since repo is squash-merge only"

This reverts commit 1257a47.

* Revert "fix(ci): ignore GitHub "Apply suggestion" commits in commitlint"

This reverts commit c395657.

* docs: fix markdownlint MD032 in README (blank line before list)

* refactor: make coverage ratchet script idiomatic TypeScript

- Wrap in main() with process.exitCode instead of scattered process.exit()
- Replace mutable flags with .map()/.some() over typed MetricResult[]
- Separate pure logic (checkMetrics) from formatting (formatReport)
- Throw with { cause } chaining instead of exit-in-helpers
- Derive CoverageThresholds from METRICS tuple (single source of truth)
- Exhaustive switch on CheckStatus discriminated union

* refactor: remove duplication in coverage ratchet script

- Drop STATUS_LABELS map; inline labels in exhaustive switch
- Extract common 'metric coverage is N%' preamble in formatResult
- Simplify ratchetedThresholds: use results directly (already in
  METRICS order) instead of re-scanning with .find() per metric
- Compute 'failed' once in main, pass into formatReport to avoid
  duplicate .some() scan

* refactor: simplify coverage ratchet with FP patterns

- Extract classify() as a named pure function (replaces nested ternary)
- loadJSON takes repo-relative paths, eliminating THRESHOLD_PATH and
  SUMMARY_PATH constants (DRY the join-with-REPO_ROOT pattern)
- Drop CoverageMetric/CoverageSummary interfaces (only pct is read);
  use structural type at the call site instead
- Inline ratchetedThresholds (one-liner, used once)
- formatReport derives fail/improved from results instead of taking
  a pre-computed boolean (let functions derive from data, don't
  thread derived state)
- sections.join("\n\n") replaces manual empty-string pushing
- Shorter type names (Thresholds, Status, Result) — no ambiguity
  in a single-purpose script

* refactor: strip coverage ratchet to failure-only output

prek hides output from commands that exit 0, so ok/improved
reporting was dead code. Remove Status, Result, classify,
formatResult, formatReport, and the ratcheted-thresholds
suggestion block. The script now just filters for regressions
and prints actionable errors on failure.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com>

* fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets (NVIDIA#438)

* fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets

The egress proxy's HTTP idle timeout (~2 min) kills long-lived WebSocket
connections when endpoints are configured with protocol:rest + tls:terminate.
Switch WebSocket endpoints to access:full (CONNECT tunnel) which bypasses
HTTP-level timeouts entirely.

Discord:
- gateway.discord.gg → access:full (WebSocket gateway)
- Add PUT/PATCH/DELETE methods for discord.com (message editing, reactions)
- Add media.discordapp.net for attachment access

Slack:
- Add wss-primary.slack.com and wss-backup.slack.com → access:full
  (Socket Mode WebSocket endpoints)

Partially addresses NVIDIA#409 — the policy-level fix enables WebSocket
connections to survive. The hardcoded 2-min timeout in openshell-sandbox
still affects any protocol:rest endpoints with long-lived connections.

Related: NVIDIA#361 (WhatsApp Web, same root cause)

* fix: correct comment wording for media endpoint and YAML formatting

* fix: standardize Node.js minimum version to 22.16 (NVIDIA#840)

* fix: remove unused RECOMMENDED_NODE_MAJOR from scripts/install.sh

Shellcheck flagged it as unused after the min/recommended merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enforce full semver >=22.16.0 in installer scripts

The runtime checks only compared the major Node.js version, allowing
22.0–22.15 to pass despite package.json requiring >=22.16.0. Use the
version_gte() helper for full semver comparison in both installers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: harden version_gte and align fallback message

Guard version_gte() against prerelease suffixes (e.g. "22.16.0-rc.1")
that would crash bash arithmetic. Also update the manual-install
fallback message to reference MIN_NODE_VERSION instead of hardcoded "22".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update test stubs for Node.js 22.16 minimum and add Node 20 rejection test

- Bump node stub in 'succeeds with acceptable Node.js' from v20.0.0 to v22.16.0
- Bump node stub in buildCurlPipeEnv from v22.14.0 to v22.16.0
- Add new test asserting Node.js 20 is rejected by ensure_supported_runtime

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: harden installer and onboard resiliency (NVIDIA#961)

* fix: harden installer and onboard resiliency

* fix: address installer and debug review follow-ups

* fix: harden onboard resume across later setup steps

* test: simplify payload extraction in onboard tests

* test: keep onboard payload extraction target-compatible

* chore: align onboard session lint with complexity rule

* fix: harden onboard session safety and lock handling

* fix: tighten onboard session redaction and metadata handling

* fix(security): strip credentials from migration snapshots and enforce blueprint digest (NVIDIA#769)

Reconciles NVIDIA#156 and NVIDIA#743 into a single comprehensive solution:

- Filter auth-profiles.json at copy time via cpSync filter (from NVIDIA#743)
- Recursive stripCredentials() with pattern-based field detection for
  deep config sanitization (from NVIDIA#156: CREDENTIAL_FIELDS set +
  CREDENTIAL_FIELD_PATTERN regex)
- Remove gateway config section (contains auth tokens) from sandbox
  openclaw.json
- Blueprint digest verification (SHA-256): recorded at snapshot time,
  validated on restore, empty/missing digest is a hard failure
- computeFileDigest() throws when blueprint file is missing instead of
  silently returning null
- Sanitize both snapshot-level and sandbox-bundle openclaw.json copies
- Backward compatible: old snapshots without blueprintDigest skip
  validation
- Bump SNAPSHOT_VERSION 2 → 3

Supersedes NVIDIA#156 and NVIDIA#743.

* fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects (NVIDIA#1025)

* fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects

OpenShell injects NO_PROXY=127.0.0.1,localhost,::1 into the sandbox, missing
inference.local and the gateway IP (10.200.0.1). This causes LLM inference
requests to route through the egress proxy instead of going direct, and the
proxy gateway IP itself gets proxied.

Add proxy configuration block to nemoclaw-start.sh that:
- Exports HTTP_PROXY, HTTPS_PROXY, and NO_PROXY with inference.local and
  the gateway IP included
- Persists via /etc/profile.d/nemoclaw-proxy.sh (root) or ~/.profile
  (non-root fallback) so values survive OpenShell reconnect injection
- Supports NEMOCLAW_PROXY_HOST / NEMOCLAW_PROXY_PORT overrides

The non-root fallback ensures the fix works in environments like Brev where
containers run without root privileges.

Tested on DGX Spark (ARM64) and Brev VM (x86_64). Verified NO_PROXY contains
inference.local and 10.200.0.1 inside the live sandbox after connect.

Ref: NVIDIA#626, NVIDIA#704
Ref: NVIDIA#704 (comment)

* fix(sandbox): write proxy config to ~/.bashrc for interactive reconnect sessions

OpenShell's `sandbox connect` spawns `/bin/bash -i` (interactive, non-login),
which sources ~/.bashrc — not ~/.profile or /etc/profile.d/*.  The previous
approach wrote to ~/.profile and /etc/profile.d/, neither of which is sourced
by `bash -i`, so the narrow OpenShell-injected NO_PROXY persisted in live
interactive sessions.

Changes:
- Write proxy snippet to ~/.bashrc (primary) and ~/.profile (login fallback)
- Export both uppercase and lowercase proxy variants (NO_PROXY + no_proxy,
  HTTP_PROXY + http_proxy, etc.) — Node.js undici prefers lowercase no_proxy
  over uppercase NO_PROXY when both are set
- Add idempotency guard to prevent duplicate blocks on container restart
- Update tests: verify .bashrc writing, idempotency, bash -i override
  behavior, and lowercase variant correctness

Tested on DGX Spark (ARM64) and Brev VM (x86_64) with full destroy +
re-onboard + live `env | grep proxy` verification inside the sandbox shell
via `openshell sandbox connect`.

Ref: NVIDIA#626

* fix(sandbox): replace stale proxy values on restart with begin/end markers

Use begin/end markers in .bashrc/.profile proxy snippet so
_write_proxy_snippet replaces the block when PROXY_HOST/PORT change
instead of silently keeping stale values. Adds test coverage for the
replacement path.

Addresses CodeRabbit review feedback on idempotency gap.

* fix(sandbox): resolve sandbox user home dynamically when running as root

When the entrypoint runs as root, $HOME is /root — the proxy snippet
was written to /root/.bashrc instead of the sandbox user's home.
Use getent passwd to look up the sandbox user's home when running as
UID 0; fall back to /sandbox if the user entry is missing.

Addresses CodeRabbit review feedback on _SANDBOX_HOME resolution.

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>

* fix(policies): preset application for versionless policies (Fixes NVIDIA#35) (NVIDIA#101)

* fix(policies): allow preset application for versionless policies (Fixes NVIDIA#35)

Fixes NVIDIA#35

Signed-off-by: Deepak Jain <deepujain@gmail.com>

* fix: remove stale complexity suppression in policies

---------

Signed-off-by: Deepak Jain <deepujain@gmail.com>
Co-authored-by: Kevin Jones <kejones@nvidia.com>

* fix: restore routed inference and connect UX (NVIDIA#1037)

* fix: restore routed inference and connect UX

* fix: simplify detected local inference hint

* fix: remove stale local inference hint

* test: relax connect forward assertion

---------

Signed-off-by: peteryuqin <peter.yuqin@gmail.com>
Signed-off-by: Deepak Jain <deepujain@gmail.com>
Co-authored-by: KJ <kejones@nvidia.com>
Co-authored-by: Emily Wilkins <80470879+epwilkins@users.noreply.github.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Peter <peter.yuqin@gmail.com>
Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com>
Co-authored-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
Co-authored-by: Lucas Wang <lucas_wang@lucas-futures.com>
Co-authored-by: senthilr-nv <senthilr@nvidia.com>
Co-authored-by: Deepak Jain <deepujain@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants