fix: harden installer and onboard resiliency#961
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new atomic, resumable onboard-session module with cross-process locking and redaction; integrates session-driven resume and recovery into onboarding, augments gateway/sandbox classification and repair logic, updates installer/wrapper and debug hooks, and adds many unit, integration, and E2E tests. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as "nemoclaw CLI"
participant Lock as "Lock Manager\n(~/.nemoclaw/onboard.lock)"
participant Session as "Session Storage\n(~/.nemoclaw/onboard-session.json)"
participant Onboard as "Onboard Flow"
participant Gateway as "Gateway/OpenShell"
participant Sandbox as "Sandbox/OpenShell"
User->>CLI: run "nemoclaw onboard" / "nemoclaw onboard --resume"
CLI->>Lock: acquireOnboardLock(command)
Lock->>Lock: create/check lock file (pid,startTime,cmd)
Lock-->>CLI: acquired | holder metadata | stale-handled
CLI->>Session: loadSession() or createSession()
CLI->>Onboard: execute steps
Onboard->>Gateway: start/check health (getGatewayStartEnv, getGatewayReuseState)
Onboard->>Sandbox: create/check/repair (createSandbox, repairRecordedSandbox)
Onboard->>Session: markStepStarted/Complete/Failed
Onboard->>Session: saveSession() (atomic write + redact)
CLI->>Lock: releaseOnboardLock()
sequenceDiagram
actor User
participant CLI as "nemoclaw CLI"
participant Session as "Session Storage"
participant Recovery as "Runtime Recovery"
participant Gateway as "Gateway/OpenShell"
participant Sandbox as "Sandbox/OpenShell"
User->>CLI: "nemoclaw onboard --resume"
CLI->>Session: loadSession()
Session-->>CLI: session (resumable? recorded steps)
CLI->>Recovery: classifyGatewayStatus(), classifySandboxLookup()
alt Recoverable (no conflicts)
CLI->>Gateway: skip/start/select based on state
CLI->>Sandbox: skip/recreate based on state
CLI->>Onboard: continue remaining steps and save progress
else ConflictsDetected
CLI-->>User: abort resume with conflict errors
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/uninstall.test.js (1)
75-116:⚠️ Potential issue | 🟠 MajorStub
npmin these shim tests.Both tests source
uninstall.shand callremove_nemoclaw_cli(), which still runsnpm unlink -g nemoclaw/npm uninstall -g nemoclawbefore it inspects the shim. With the current env, that hits whatevernpmis on the host PATH, so the suite can mutate the developer/CI global install. Put a fakenpmearlier inPATHhere, the same way the--yestest already does.🧪 Minimal hardening
const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-uninstall-preserve-")); + const fakeBin = path.join(tmp, "bin"); const shimDir = path.join(tmp, ".local", "bin"); const shimPath = path.join(shimDir, "nemoclaw"); + fs.mkdirSync(fakeBin, { recursive: true }); fs.mkdirSync(shimDir, { recursive: true }); fs.writeFileSync(shimPath, "#!/usr/bin/env bash\n", { mode: 0o755 }); + fs.writeFileSync(path.join(fakeBin, "npm"), "#!/usr/bin/env bash\nexit 0\n", { mode: 0o755 }); const result = spawnSync( "bash", ["-lc", `HOME="${tmp}" source "${UNINSTALL_SCRIPT}"; remove_nemoclaw_cli`], { cwd: path.join(import.meta.dirname, ".."), encoding: "utf-8", + env: { ...process.env, PATH: `${fakeBin}:/usr/bin:/bin` }, }, );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/uninstall.test.js` around lines 75 - 116, The tests invoking remove_nemoclaw_cli (via sourcing UNINSTALL_SCRIPT in the spawnSync bash command) must stub npm on PATH so the script's npm unlink/uninstall calls don't invoke the host npm; create a temp "bin" directory, write a small fake "npm" executable (exit 0 or mimic expected output) and make it executable, prepend that bin dir to PATH in the env passed to spawnSync (as done in the --yes test), and use that env for both spawnSync calls that source UNINSTALL_SCRIPT so the installer-managed shim logic is exercised without touching the real global npm.
🧹 Nitpick comments (4)
scripts/install.sh (1)
15-23: Write the deprecation banner to stderr.This wrapper currently prepends warning text to stdout before
execing the real installer. That changes the stdout contract of delegated commands like--help/--version; sending the banner to stderr keeps the warning without polluting the underlying installer output.🪄 Small cleanup
warn_legacy_path() { - cat <<EOF + cat <<EOF >&2 [install] deprecated compatibility wrapper: scripts/install.sh [install] supported installer: ${ROOT_INSTALLER_URL} EOF }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/install.sh` around lines 15 - 23, The deprecation banner currently writes to stdout via warn_legacy_path (the heredoc using cat <<EOF) and then calls warn_legacy_path; change the function so its heredoc writes to stderr (e.g., redirect the heredoc/cat output to >&2 or use printf to >&2) and keep the existing warn_legacy_path invocation so the banner is emitted to stderr before execing the real installer; ensure the message text and variable ${ROOT_INSTALLER_URL} remain unchanged.install.sh (1)
273-281: Minor:detect_shell_profilecould miss fish/other shells.The function defaults to
.bashrcfor non-zsh shells. Users running fish, tcsh, or other shells won't get accurate profile suggestions. This is a reasonable simplification for the common case, but consider adding a note in the output or documentation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@install.sh` around lines 273 - 281, The detect_shell_profile function currently defaults non-zsh shells to ~/.bashrc which can misidentify users of fish, tcsh, etc.; update detect_shell_profile to check $SHELL basename for common shells like fish (use ~/.config/fish/config.fish), tcsh/csh (use ~/.tcshrc or ~/.cshrc), and fallback to ~/.profile or print a visible note when the shell is unrecognized; reference the detect_shell_profile function and adjust its branching logic to add these checks and/or emit a comment suggesting manual profile selection when an unknown shell is detected.test/onboard-session.test.js (1)
10-14: Consider potential test isolation concern with global HOME modification.Setting
process.env.HOMEat module load time (Line 11) affects the entire Node process for the duration of the test run. While this works correctly when this test file runs in isolation, it could cause issues if:
- Other test files in the same Vitest worker expect the original HOME
- The module under test is cached with the modified HOME path
This is likely fine given Vitest's default worker isolation, but worth noting if flaky behavior appears.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard-session.test.js` around lines 10 - 14, The test mutates the global process.env.HOME at module load (tmpDir and process.env.HOME) and then requires the module via createRequire(import.meta.url) + require("../bin/lib/onboard-session"), which can leak into other tests or cache the module with the wrong HOME; fix by saving the original HOME before changing it, set process.env.HOME to tmpDir only immediately before requiring the module (or inside a beforeEach), require the module (session) while HOME is modified, then restore the original HOME immediately after (or in an afterEach) so other tests see the original environment and the module cache doesn’t persist an unexpected HOME.test/onboard.test.js (1)
372-411: Consider adding a clarifying comment for the provider mapping chain.The test validates that
"cloud"provider →"nvidia-prod"effective provider, but the two-stage mapping (viagetRequestedProviderHint→getEffectiveProviderName) isn't immediately clear. A brief comment explaining this alias chain would improve maintainability.📝 Suggested clarifying comment
it("detects resume conflicts for explicit provider and model changes", () => { const previousProvider = process.env.NEMOCLAW_PROVIDER; const previousModel = process.env.NEMOCLAW_MODEL; + // "cloud" is an alias that maps through getRequestedProviderHint ("build") + // to getEffectiveProviderName ("nvidia-prod") process.env.NEMOCLAW_PROVIDER = "cloud"; process.env.NEMOCLAW_MODEL = "nvidia/other-model";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard.test.js` around lines 372 - 411, Add a brief clarifying comment in the test near the setup of process.env.NEMOCLAW_PROVIDER explaining the two-stage provider alias resolution used by the code under test: that getRequestedProviderHint maps the external alias ("cloud") to an internal requested hint and then getEffectiveProviderName resolves that hint to the effective provider name ("nvidia-prod"), so the test expects provider conflict between requested "nvidia-prod" and recorded "nvidia-nim"; place this comment adjacent to the getResumeConfigConflicts invocation (or the environment setup) to make the alias chain explicit for future readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/debug.sh`:
- Around line 110-112: The current derivation of SCRIPT_DIR, REPO_ROOT and
ONBOARD_SESSION_HELPER can resolve relative to the caller when the script is run
via "bash -s", so update the logic that sets SCRIPT_DIR and REPO_ROOT to first
verify that BASH_SOURCE[0] is non-empty and points to an existing file; if not,
leave ONBOARD_SESSION_HELPER empty (or unset) so it cannot point at a caller's
../bin/lib/onboard-session.js. Then gate the helper usage (the conditional that
uses ONBOARD_SESSION_HELPER / require('./../bin/lib/onboard-session.js')) on
ONBOARD_SESSION_HELPER being non-empty/real. Refer to the variables SCRIPT_DIR,
REPO_ROOT, and ONBOARD_SESSION_HELPER and the helper-require conditional to
implement these checks.
---
Outside diff comments:
In `@test/uninstall.test.js`:
- Around line 75-116: The tests invoking remove_nemoclaw_cli (via sourcing
UNINSTALL_SCRIPT in the spawnSync bash command) must stub npm on PATH so the
script's npm unlink/uninstall calls don't invoke the host npm; create a temp
"bin" directory, write a small fake "npm" executable (exit 0 or mimic expected
output) and make it executable, prepend that bin dir to PATH in the env passed
to spawnSync (as done in the --yes test), and use that env for both spawnSync
calls that source UNINSTALL_SCRIPT so the installer-managed shim logic is
exercised without touching the real global npm.
---
Nitpick comments:
In `@install.sh`:
- Around line 273-281: The detect_shell_profile function currently defaults
non-zsh shells to ~/.bashrc which can misidentify users of fish, tcsh, etc.;
update detect_shell_profile to check $SHELL basename for common shells like fish
(use ~/.config/fish/config.fish), tcsh/csh (use ~/.tcshrc or ~/.cshrc), and
fallback to ~/.profile or print a visible note when the shell is unrecognized;
reference the detect_shell_profile function and adjust its branching logic to
add these checks and/or emit a comment suggesting manual profile selection when
an unknown shell is detected.
In `@scripts/install.sh`:
- Around line 15-23: The deprecation banner currently writes to stdout via
warn_legacy_path (the heredoc using cat <<EOF) and then calls warn_legacy_path;
change the function so its heredoc writes to stderr (e.g., redirect the
heredoc/cat output to >&2 or use printf to >&2) and keep the existing
warn_legacy_path invocation so the banner is emitted to stderr before execing
the real installer; ensure the message text and variable ${ROOT_INSTALLER_URL}
remain unchanged.
In `@test/onboard-session.test.js`:
- Around line 10-14: The test mutates the global process.env.HOME at module load
(tmpDir and process.env.HOME) and then requires the module via
createRequire(import.meta.url) + require("../bin/lib/onboard-session"), which
can leak into other tests or cache the module with the wrong HOME; fix by saving
the original HOME before changing it, set process.env.HOME to tmpDir only
immediately before requiring the module (or inside a beforeEach), require the
module (session) while HOME is modified, then restore the original HOME
immediately after (or in an afterEach) so other tests see the original
environment and the module cache doesn’t persist an unexpected HOME.
In `@test/onboard.test.js`:
- Around line 372-411: Add a brief clarifying comment in the test near the setup
of process.env.NEMOCLAW_PROVIDER explaining the two-stage provider alias
resolution used by the code under test: that getRequestedProviderHint maps the
external alias ("cloud") to an internal requested hint and then
getEffectiveProviderName resolves that hint to the effective provider name
("nvidia-prod"), so the test expects provider conflict between requested
"nvidia-prod" and recorded "nvidia-nim"; place this comment adjacent to the
getResumeConfigConflicts invocation (or the environment setup) to make the alias
chain explicit for future readers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: db04f9b7-e31f-4bdb-822a-e593c19f4f0e
📒 Files selected for processing (17)
bin/lib/onboard-session.jsbin/lib/onboard.jsbin/lib/runtime-recovery.jsbin/nemoclaw.jsinstall.shscripts/debug.shscripts/install.shtest/cli.test.jstest/e2e/test-onboard-repair.shtest/e2e/test-onboard-resume.shtest/install-preflight.test.jstest/nemoclaw-cli-recovery.test.jstest/onboard-session.test.jstest/onboard.test.jstest/runtime-recovery.test.jstest/uninstall.test.jsuninstall.sh
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/onboard.test.js (1)
927-932: Consider usingfindLast()for cleaner code.The
.slice().reverse().find()pattern works correctly, but can be simplified withArray.prototype.findLast()(available in Node.js 18+). The same pattern appears at lines 1013-1018.✨ Suggested simplification
const payloadLine = result.stdout .trim() .split("\n") - .slice() - .reverse() - .find((line) => line.startsWith("{") && line.endsWith("}")); + .findLast((line) => line.startsWith("{") && line.endsWith("}"));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard.test.js` around lines 927 - 932, Replace the ".slice().reverse().find()" arrays used to locate the JSON payload with the simpler Array.prototype.findLast() to improve readability: change the assignment to payloadLine (the expression starting from result.stdout.trim().split("\n").slice().reverse().find(...)) to use .findLast(line => line.startsWith("{") && line.endsWith("}")) instead, and make the same replacement for the other identical pattern later in the file (the second result.stdout processing around the other payload extraction). Ensure Node.js 18+ runtime compatibility is acceptable before applying.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/onboard.test.js`:
- Around line 927-932: Replace the ".slice().reverse().find()" arrays used to
locate the JSON payload with the simpler Array.prototype.findLast() to improve
readability: change the assignment to payloadLine (the expression starting from
result.stdout.trim().split("\n").slice().reverse().find(...)) to use
.findLast(line => line.startsWith("{") && line.endsWith("}")) instead, and make
the same replacement for the other identical pattern later in the file (the
second result.stdout processing around the other payload extraction). Ensure
Node.js 18+ runtime compatibility is acceptable before applying.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2bb959e9-9326-4a12-ad90-0daaaca7e29a
📒 Files selected for processing (9)
bin/lib/onboard-session.jsbin/lib/onboard.jsinstall.shscripts/debug.shscripts/install.shtest/install-preflight.test.jstest/onboard-session.test.jstest/onboard.test.jstest/uninstall.test.js
✅ Files skipped from review due to trivial changes (3)
- scripts/debug.sh
- test/uninstall.test.js
- bin/lib/onboard.js
🚧 Files skipped from review as they are similar to previous changes (3)
- test/onboard-session.test.js
- test/install-preflight.test.js
- bin/lib/onboard-session.js
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard-session.js`:
- Line 320: The code currently assigns updates.endpointUrl directly to
safe.endpointUrl (and later returns it from summarizeForDebug()), which can
expose credentials; update the assignment path that sets safe.endpointUrl (and
any similar assignment around the other occurrences) to first run the URL
through a redaction helper (create a small helper used by summarizeForDebug()
like redactUrl(url)): parse the URL, remove or mask userinfo
(username:password@), and also strip or mask common sensitive query parameters
(e.g., signature, sig, token, auth, access_token) and then set safe.endpointUrl
to the redacted string so summarizeForDebug() never returns credential-bearing
URLs.
- Around line 305-310: completeSession currently applies filterSafeUpdates and
sets status/failure but never flips the session's resumable flag, so sessions
created with createSession (which defaults resumable to true) remain resumable
after completion; modify completeSession (the function that calls updateSession
and uses filterSafeUpdates) to explicitly set session.resumable = false after
applying safe updates so completed sessions are not considered resumable by
downstream recovery code.
- Around line 208-249: When a concurrent writer leaves the lock file unreadable
or briefly missing, the current loop treats parseLockFile(null) or readFileSync
ENOENT as stale and unlinks a still-live lock; change the logic in the
lock-acquisition loop (the block that does fs.openSync(LOCK_FILE, "wx", ...),
parseLockFile(...), and isProcessAlive(...)) so that if readFileSync throws
ENOENT or parseLockFile(...) returns null you DO NOT unlink the file but instead
treat it as a transient race and retry (i.e., continue to the next attempt after
a short backoff), and only unlink the LOCK_FILE when parseLockFile returned a
valid payload and isProcessAlive(...) is false; also update releaseOnboardLock
so it ignores ENOENT from readFileSync and does not unlink when parseLockFile
returns null (only unlink when parsed payload has matching pid).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5a038dc4-fa76-4a43-866e-10b03c3aa9c4
📒 Files selected for processing (1)
bin/lib/onboard-session.js
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
bin/lib/onboard-session.js (2)
97-113:⚠️ Potential issue | 🟠 MajorRedact fragments and provider-signed query params too.
Line 105 only masks exact key matches, so URLs like
...?X-Amz-Signature=.../...?X-Goog-Signature=...and fragment tokens such as#access_token=...still survive into the saved session andnemoclaw debug. Stripurl.hashand broaden the sensitive-key match here.🧼 Suggested hardening
+function isSensitiveUrlParam(key) { + return ( + /^(x-amz-|x-goog-)/i.test(key) || + /(^|[-_])(sig|signature|token|auth|credential|key)([-_]|$)/i.test(key) + ); +} + function redactUrl(value) { if (typeof value !== "string" || value.length === 0) return null; try { const url = new URL(value); @@ - for (const key of [...url.searchParams.keys()]) { - if (/^(signature|sig|token|auth|access_token)$/i.test(key)) { + url.hash = ""; + for (const key of [...url.searchParams.keys()]) { + if (isSensitiveUrlParam(key)) { url.searchParams.set(key, "<REDACTED>"); } } return url.toString();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard-session.js` around lines 97 - 113, The redactUrl function currently only masks exact query key matches and leaves URL fragments and provider-prefixed signature keys intact; update redactUrl to also clear url.hash (to remove fragment tokens) and broaden the search-key matching to include common prefixed variations like /(^|[-_])(?:signature|sig|token|auth|access_token)$/i or similar so keys such as X-Amz-Signature and X-Goog-Signature are caught, iterating url.searchParams to set matched keys to "<REDACTED>" (as it already does) and falling back to redactSensitiveText(value) on parse errors; reference redactUrl, url.hash, url.searchParams, and redactSensitiveText when making the change.
227-271:⚠️ Potential issue | 🟠 MajorKeep the lock path fail-closed under partial writes.
Line 230 creates the lock path before its JSON payload is complete. If that write throws, Line 231 never runs, so the fd stays open and the half-written lock is left behind; and if a competing reader hits that window twice, this path still falls through to
{ stale: true }even though the owner may still be alive. Please close and clean up the creator path in afinally, and only reportstale: trueafter successfully parsing a dead-owner record.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard-session.js`:
- Around line 298-309: markStepStarted and markStepFailed currently leave an old
completedAt timestamp on the step which makes non-complete states appear
completed; update both functions (markStepStarted and markStepFailed) inside the
updateSession callback where you mutate session.steps[stepName] (the step
object) to explicitly clear step.completedAt (e.g., set to null or undefined)
whenever you transition the step.status away from "complete", keeping the
existing changes to status, startedAt, error, failure, and session.status as is.
- Around line 365-369: The helper currently sets safe.metadata = {} whenever
isObject(updates.metadata) is true, which causes existing metadata.gatewayName
to be dropped if updates.metadata lacks gatewayName; change the logic in the
metadata handling (the isObject(updates.metadata) block) to only set
safe.metadata when a whitelisted field is present—check for typeof
updates.metadata.gatewayName === "string" and only then initialize safe.metadata
and assign safe.metadata.gatewayName, avoiding creating an empty safe.metadata
that would be merged away by Object.assign.
---
Duplicate comments:
In `@bin/lib/onboard-session.js`:
- Around line 97-113: The redactUrl function currently only masks exact query
key matches and leaves URL fragments and provider-prefixed signature keys
intact; update redactUrl to also clear url.hash (to remove fragment tokens) and
broaden the search-key matching to include common prefixed variations like
/(^|[-_])(?:signature|sig|token|auth|access_token)$/i or similar so keys such as
X-Amz-Signature and X-Goog-Signature are caught, iterating url.searchParams to
set matched keys to "<REDACTED>" (as it already does) and falling back to
redactSensitiveText(value) on parse errors; reference redactUrl, url.hash,
url.searchParams, and redactSensitiveText when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 786f3657-8be4-4a4c-a389-7f372633331e
📒 Files selected for processing (2)
bin/lib/onboard-session.jstest/onboard-session.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/onboard-session.test.js
|
Nit: |
* fix: improve gateway lifecycle recovery (NVIDIA#953) * fix: improve gateway lifecycle recovery * docs: fix readme markdown list spacing * fix: tighten gateway lifecycle review follow-ups * fix: simplify tokenized control ui output * fix: restore chat route in control ui urls * refactor: simplify ansi stripping in onboard * fix: shorten control ui url output * fix: move control ui below cli next steps * fix: swap hard/soft ulimit settings in start script (NVIDIA#951) Fixes NVIDIA#949 Co-authored-by: KJ <kejones@nvidia.com> * chore: add cyclomatic complexity lint rule (NVIDIA#875) * chore: add cyclomatic complexity rule (ratchet from 95) Add ESLint complexity rule to bin/ and scripts/ to prevent new functions from accumulating excessive branching. Starting threshold is 95 (current worst offender: setupNim in onboard.js). Ratchet plan: 95 → 40 → 25 → 15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: ratchet complexity to 20, suppress existing violations Suppress 6 functions that exceed the threshold with eslint-disable comments so we can start enforcing at 20 instead of 95: - setupNim (95), setupPolicies (41), setupInference (22) in onboard.js - deploy (22), main IIFE (27) in nemoclaw.js - applyPreset (24) in policies.js Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: suppress complexity for 3 missed functions preflight (23), getReconciledSandboxGatewayState (25), sandboxStatus (27) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add host-side config and state file locations to README (NVIDIA#903) Signed-off-by: peteryuqin <peter.yuqin@gmail.com> * chore: add tsconfig.cli.json, root execa, TS coverage ratchet (NVIDIA#913) * chore: add tsconfig.cli.json, root execa, TS coverage ratchet Foundation for the CLI TypeScript migration (PR 0 of the shell consolidation plan). No runtime changes — config, tooling, and dependency only. - tsconfig.cli.json: strict TS type-checking for bin/ and scripts/ (noEmit, module: preserve — tsx handles the runtime) - scripts/check-coverage-ratchet.ts: pure TS replacement for the bash+python coverage ratchet script (same logic, same tolerance) - execa ^9.6.1 added to root devDependencies (used by PR 1+) - pr.yaml: coverage ratchet step now runs the TS version via tsx - .pre-commit-config.yaml: SPDX headers cover scripts/*.ts, new tsc-check-cli pre-push hook - CONTRIBUTING.md: document typecheck:cli task and CLI pre-push hook - Delete scripts/check-coverage-ratchet.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Apply suggestion from @brandonpelfrey * chore: address PR feedback — use types_or, add tsx devDep - Use `types_or: [ts, tsx]` instead of file glob for tsc-check-cli hook per @brandonpelfrey's suggestion. - Add `tsx` to devDependencies so CI doesn't re-fetch it on every run per CodeRabbit's suggestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ci): ignore GitHub "Apply suggestion" commits in commitlint * fix(ci): lint only PR title since repo is squash-merge only Reverts the commitlint ignores rule from the previous commit and instead removes the per-commit lint step entirely. Individual commit messages are discarded at merge time — only the squash-merged PR title lands in main and drives changelog generation. Drop the per-commit lint, keep the PR title check, and remove the now-unnecessary fetch-depth: 0. * Revert "fix(ci): lint only PR title since repo is squash-merge only" This reverts commit 1257a47. * Revert "fix(ci): ignore GitHub "Apply suggestion" commits in commitlint" This reverts commit c395657. * docs: fix markdownlint MD032 in README (blank line before list) * refactor: make coverage ratchet script idiomatic TypeScript - Wrap in main() with process.exitCode instead of scattered process.exit() - Replace mutable flags with .map()/.some() over typed MetricResult[] - Separate pure logic (checkMetrics) from formatting (formatReport) - Throw with { cause } chaining instead of exit-in-helpers - Derive CoverageThresholds from METRICS tuple (single source of truth) - Exhaustive switch on CheckStatus discriminated union * refactor: remove duplication in coverage ratchet script - Drop STATUS_LABELS map; inline labels in exhaustive switch - Extract common 'metric coverage is N%' preamble in formatResult - Simplify ratchetedThresholds: use results directly (already in METRICS order) instead of re-scanning with .find() per metric - Compute 'failed' once in main, pass into formatReport to avoid duplicate .some() scan * refactor: simplify coverage ratchet with FP patterns - Extract classify() as a named pure function (replaces nested ternary) - loadJSON takes repo-relative paths, eliminating THRESHOLD_PATH and SUMMARY_PATH constants (DRY the join-with-REPO_ROOT pattern) - Drop CoverageMetric/CoverageSummary interfaces (only pct is read); use structural type at the call site instead - Inline ratchetedThresholds (one-liner, used once) - formatReport derives fail/improved from results instead of taking a pre-computed boolean (let functions derive from data, don't thread derived state) - sections.join("\n\n") replaces manual empty-string pushing - Shorter type names (Thresholds, Status, Result) — no ambiguity in a single-purpose script * refactor: strip coverage ratchet to failure-only output prek hides output from commands that exit 0, so ok/improved reporting was dead code. Remove Status, Result, classify, formatResult, formatReport, and the ratcheted-thresholds suggestion block. The script now just filters for regressions and prints actionable errors on failure. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com> * fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets (NVIDIA#438) * fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets The egress proxy's HTTP idle timeout (~2 min) kills long-lived WebSocket connections when endpoints are configured with protocol:rest + tls:terminate. Switch WebSocket endpoints to access:full (CONNECT tunnel) which bypasses HTTP-level timeouts entirely. Discord: - gateway.discord.gg → access:full (WebSocket gateway) - Add PUT/PATCH/DELETE methods for discord.com (message editing, reactions) - Add media.discordapp.net for attachment access Slack: - Add wss-primary.slack.com and wss-backup.slack.com → access:full (Socket Mode WebSocket endpoints) Partially addresses NVIDIA#409 — the policy-level fix enables WebSocket connections to survive. The hardcoded 2-min timeout in openshell-sandbox still affects any protocol:rest endpoints with long-lived connections. Related: NVIDIA#361 (WhatsApp Web, same root cause) * fix: correct comment wording for media endpoint and YAML formatting * fix: standardize Node.js minimum version to 22.16 (NVIDIA#840) * fix: remove unused RECOMMENDED_NODE_MAJOR from scripts/install.sh Shellcheck flagged it as unused after the min/recommended merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enforce full semver >=22.16.0 in installer scripts The runtime checks only compared the major Node.js version, allowing 22.0–22.15 to pass despite package.json requiring >=22.16.0. Use the version_gte() helper for full semver comparison in both installers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden version_gte and align fallback message Guard version_gte() against prerelease suffixes (e.g. "22.16.0-rc.1") that would crash bash arithmetic. Also update the manual-install fallback message to reference MIN_NODE_VERSION instead of hardcoded "22". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update test stubs for Node.js 22.16 minimum and add Node 20 rejection test - Bump node stub in 'succeeds with acceptable Node.js' from v20.0.0 to v22.16.0 - Bump node stub in buildCurlPipeEnv from v22.14.0 to v22.16.0 - Add new test asserting Node.js 20 is rejected by ensure_supported_runtime --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden installer and onboard resiliency (NVIDIA#961) * fix: harden installer and onboard resiliency * fix: address installer and debug review follow-ups * fix: harden onboard resume across later setup steps * test: simplify payload extraction in onboard tests * test: keep onboard payload extraction target-compatible * chore: align onboard session lint with complexity rule * fix: harden onboard session safety and lock handling * fix: tighten onboard session redaction and metadata handling * fix(security): strip credentials from migration snapshots and enforce blueprint digest (NVIDIA#769) Reconciles NVIDIA#156 and NVIDIA#743 into a single comprehensive solution: - Filter auth-profiles.json at copy time via cpSync filter (from NVIDIA#743) - Recursive stripCredentials() with pattern-based field detection for deep config sanitization (from NVIDIA#156: CREDENTIAL_FIELDS set + CREDENTIAL_FIELD_PATTERN regex) - Remove gateway config section (contains auth tokens) from sandbox openclaw.json - Blueprint digest verification (SHA-256): recorded at snapshot time, validated on restore, empty/missing digest is a hard failure - computeFileDigest() throws when blueprint file is missing instead of silently returning null - Sanitize both snapshot-level and sandbox-bundle openclaw.json copies - Backward compatible: old snapshots without blueprintDigest skip validation - Bump SNAPSHOT_VERSION 2 → 3 Supersedes NVIDIA#156 and NVIDIA#743. * fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects (NVIDIA#1025) * fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects OpenShell injects NO_PROXY=127.0.0.1,localhost,::1 into the sandbox, missing inference.local and the gateway IP (10.200.0.1). This causes LLM inference requests to route through the egress proxy instead of going direct, and the proxy gateway IP itself gets proxied. Add proxy configuration block to nemoclaw-start.sh that: - Exports HTTP_PROXY, HTTPS_PROXY, and NO_PROXY with inference.local and the gateway IP included - Persists via /etc/profile.d/nemoclaw-proxy.sh (root) or ~/.profile (non-root fallback) so values survive OpenShell reconnect injection - Supports NEMOCLAW_PROXY_HOST / NEMOCLAW_PROXY_PORT overrides The non-root fallback ensures the fix works in environments like Brev where containers run without root privileges. Tested on DGX Spark (ARM64) and Brev VM (x86_64). Verified NO_PROXY contains inference.local and 10.200.0.1 inside the live sandbox after connect. Ref: NVIDIA#626, NVIDIA#704 Ref: NVIDIA#704 (comment) * fix(sandbox): write proxy config to ~/.bashrc for interactive reconnect sessions OpenShell's `sandbox connect` spawns `/bin/bash -i` (interactive, non-login), which sources ~/.bashrc — not ~/.profile or /etc/profile.d/*. The previous approach wrote to ~/.profile and /etc/profile.d/, neither of which is sourced by `bash -i`, so the narrow OpenShell-injected NO_PROXY persisted in live interactive sessions. Changes: - Write proxy snippet to ~/.bashrc (primary) and ~/.profile (login fallback) - Export both uppercase and lowercase proxy variants (NO_PROXY + no_proxy, HTTP_PROXY + http_proxy, etc.) — Node.js undici prefers lowercase no_proxy over uppercase NO_PROXY when both are set - Add idempotency guard to prevent duplicate blocks on container restart - Update tests: verify .bashrc writing, idempotency, bash -i override behavior, and lowercase variant correctness Tested on DGX Spark (ARM64) and Brev VM (x86_64) with full destroy + re-onboard + live `env | grep proxy` verification inside the sandbox shell via `openshell sandbox connect`. Ref: NVIDIA#626 * fix(sandbox): replace stale proxy values on restart with begin/end markers Use begin/end markers in .bashrc/.profile proxy snippet so _write_proxy_snippet replaces the block when PROXY_HOST/PORT change instead of silently keeping stale values. Adds test coverage for the replacement path. Addresses CodeRabbit review feedback on idempotency gap. * fix(sandbox): resolve sandbox user home dynamically when running as root When the entrypoint runs as root, $HOME is /root — the proxy snippet was written to /root/.bashrc instead of the sandbox user's home. Use getent passwd to look up the sandbox user's home when running as UID 0; fall back to /sandbox if the user entry is missing. Addresses CodeRabbit review feedback on _SANDBOX_HOME resolution. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com> * fix(policies): preset application for versionless policies (Fixes NVIDIA#35) (NVIDIA#101) * fix(policies): allow preset application for versionless policies (Fixes NVIDIA#35) Fixes NVIDIA#35 Signed-off-by: Deepak Jain <deepujain@gmail.com> * fix: remove stale complexity suppression in policies --------- Signed-off-by: Deepak Jain <deepujain@gmail.com> Co-authored-by: Kevin Jones <kejones@nvidia.com> * fix: restore routed inference and connect UX (NVIDIA#1037) * fix: restore routed inference and connect UX * fix: simplify detected local inference hint * fix: remove stale local inference hint * test: relax connect forward assertion --------- Signed-off-by: peteryuqin <peter.yuqin@gmail.com> Signed-off-by: Deepak Jain <deepujain@gmail.com> Co-authored-by: KJ <kejones@nvidia.com> Co-authored-by: Emily Wilkins <80470879+epwilkins@users.noreply.github.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Peter <peter.yuqin@gmail.com> Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com> Co-authored-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com> Co-authored-by: Lucas Wang <lucas_wang@lucas-futures.com> Co-authored-by: senthilr-nv <senthilr@nvidia.com> Co-authored-by: Deepak Jain <deepujain@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Summary
Hardens NemoClaw install and onboarding resiliency across repeat runs, partial failures, and restart/runtime drift.
Supersedes #770 with a clean signed-commit branch.
Addresses:
What Changed
nemoclaw onboard --resumenemoclaw debugmainHow To Test
Start from the PR branch:
Please focus on resiliency and recovery behavior:
Fresh onboard
./install.shnemoclaw onboardnemoclaw <sandbox> statusandnemoclaw <sandbox> connectInterrupted onboard + resume
nemoclaw onboard --resumeRepeat onboard
Runtime/gateway drift
nemoclaw <sandbox> statusandnemoclaw <sandbox> connectonboard --resumeguidance when neededDebug visibility
nemoclaw debug --quickPlease include in any bug report:
nemoclaw debug --quickopenshell statusopenshell sandbox listopenshell gateway infoValidation
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests