fix: harden installer and onboard resiliency by kjw3 · Pull Request #770 · NVIDIA/NemoClaw

kjw3 · 2026-03-24T05:13:29Z

Summary

Hardens NemoClaw install and onboarding resiliency across repeat runs, partial failures, and restart/runtime drift.

Addresses:

nemoclaw onboard` is not resumable — partial failures require full cleanup and restart #446 resumable onboarding
Onboarding fails with timeout during sandbox image push/import into OpenShell gateway #623 onboarding timeout during sandbox image push/import
failed to upload image tar into container --> Timeout error #427 image upload timeout during onboard
Sandbox image push fails with 'Connection reset by peer' at 2459 MiB during onboard #245 sandbox image push connection reset during onboard
Sandbox Not Available Anymore After Machine Restart #486 sandbox unavailable after machine restart
WSL2: restarting openshell-cluster breaks gateway/sandbox connectivity and forces re-onboarding #716 WSL2 gateway/sandbox connectivity drift after restart
[NeMoClaw][Ubuntu + Docker CE] Second installer/onboard run deletes previous sandbox session and leaves stale sandbox in nemoclaw list #532 second install/onboard run leaves stale state or damages prior session

What Changed

Added persisted onboarding session state and nemoclaw onboard --resume
Added sanitized onboarding state to nemoclaw debug
Made repeat onboard non-destructive by default when the sandbox is healthy
Improved recovery and messaging for gateway/runtime drift after restart
Improved sandbox image transfer failure classification and recovery guidance
Hardened installer/uninstall behavior and related shell/session handling
Tightened session-state redaction and secret handling

How To Test

Start from the PR branch:

git fetch origin
git switch fix/446-installer-resiliency
git pull --ff-only origin fix/446-installer-resiliency

Please focus on resiliency and recovery behavior:

Fresh onboard
- ./install.sh
- nemoclaw onboard
- Verify nemoclaw <sandbox> status and nemoclaw <sandbox> connect
Interrupted onboard + resume
- Interrupt or fail onboard mid-run
- Fix the underlying issue
- Run nemoclaw onboard --resume
- Verify completed steps are skipped when safe and onboarding completes without full cleanup
Repeat onboard
- Run onboard again with the same sandbox name
- Verify a healthy sandbox is reused, not silently recreated
- Run onboard with a different sandbox name
- Verify the old sandbox remains intact
Runtime/gateway drift
- After onboarding, restart the relevant runtime/gateway context if feasible
- Run nemoclaw <sandbox> status and nemoclaw <sandbox> connect
- Verify NemoClaw attempts recovery first and falls back to clear onboard --resume guidance when needed
Debug visibility
- nemoclaw debug --quick
- Verify onboarding session state is present and secrets are not exposed

Please include in any bug report:

platform/environment
exact command run
full terminal output
nemoclaw debug --quick
openshell status
openshell sandbox list
openshell gateway info

Validation

Focused Vitest coverage for onboard session state, resume logic, runtime recovery, installer/uninstall behavior, and sandbox image transfer recovery paths
E2E coverage for interrupted onboard resume and repair flows
Pre-push hook stack passing on the branch

coderabbitai · 2026-03-24T05:13:37Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds resumable per-user onboarding sessions with versioned JSON persistence, sensitive-data redaction, atomic writes, and lifecycle helpers; introduces resume/repair logic for gateways and sandboxes, runtime-recovery heuristics and CLI --resume; updates installer PATH/recovery messaging, debug/uninstall scripts, and extensive unit and E2E tests.

Changes

Cohort / File(s)	Summary
Session storage `bin/lib/onboard-session.js`	New module: per-user JSON session with versioned schema, tolerant load, normalization, sensitive-text redaction, atomic save (temp file + rename with 0600/0700 perms), clear/delete, lifecycle helpers (`markStepStarted`/`markStepComplete`/`markStepFailed`), `completeSession`, `updateSession`, and `summarizeForDebug`.
Onboard engine & integration `bin/lib/onboard.js`	Integrated session resume/save into onboarding flow; skip/resume completed steps; record failures on abnormal exit; gateway reuse classification and recovery (`getGatewayStartEnv`, `recoverGatewayRuntime`); sandbox reuse/repair and registry cleanup; build-context filtering/copying; sandbox-create failure classification and recovery hints; many exported helpers for resume/conflict handling.
Runtime recovery helpers `bin/lib/runtime-recovery.js`	New helpers to strip/parse CLI output, parse live sandbox names, classify sandbox/gateway lookup/status, decide when to attempt recovery, and choose recovery command (`onboard --resume` vs fresh).
CLI runtime & registry `bin/nemoclaw.js`	Accepts `--resume`, forwards to onboard entrypoint; async sandbox connect/status that classify `openshell` output; `reconcileRegistryWithGateway()` to prune stale registry entries; `restoreRuntimeIfNeeded()` to conditionally recover runtime and re-run classification.
Installer & wrapper `install.sh`, `scripts/install.sh`	`install.sh`: detect shell profile, set PATH-recovery flags, and emit state-driven completion messages; `scripts/install.sh`: converted to deprecated legacy wrapper that execs repo installer or prints deprecation guidance.
Debug & uninstall scripts `scripts/debug.sh`, `uninstall.sh`	`debug.sh`: adds "Onboard Session" diagnostic that prints `summarizeForDebug()` when available. `uninstall.sh`: only removes symlinked installer shim (preserves regular user-managed files) and documents `onboard-session.json` in cleanup.
Build/context helpers `bin/lib/onboard.js`	Adds helpers to filter and copy build-context paths (exclude caches, VCS, env-specific dirs) for sandbox staging.
Tests — unit & integration `test/onboard-session.test.js`, `test/runtime-recovery.test.js`, `test/onboard.test.js`, `test/cli.test.js`, `test/install-preflight.test.js`, `test/uninstall.test.js`	New/updated unit tests covering session lifecycle, redaction, sandbox/gateway classification, build-context filtering, resume conflict detection, `--resume` parsing, legacy-wrapper behavior, and uninstall session cleanup.
Tests — E2E `test/e2e/test-onboard-resume.sh`, `test/e2e/test-onboard-repair.sh`, `test/e2e/test-double-onboard.sh`	New/updated E2E scripts validating interrupted onboarding, non-interactive resume, sandbox repair/invalidation, and safe sandbox reuse across multi-phase runs.
CLI dispatch test `test/cli.test.js`	Added test asserting `--resume` is accepted and unknown flags still produce expected errors; debug output test updated to include "Onboard Session".

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as "nemoclaw CLI"
    participant Session as "Onboard Session\n(persisted JSON)"
    participant Onboard as "Onboard Engine"
    participant Gateway as "OpenShell Gateway"
    participant Sandbox as "OpenShell Sandbox"

    rect rgba(200,150,255,0.5)
    Note over User,Onboard: Initial run — failure mid-step
    User->>CLI: nemoclaw onboard
    CLI->>Session: loadSession()
    Session-->>CLI: null
    CLI->>Onboard: start onboarding
    Onboard->>Onboard: run steps (preflight → gateway → sandbox ...)
    Onboard->>Gateway: start gateway
    Gateway-->>Onboard: ready
    Onboard->>Onboard: markStepStarted("sandbox")
    Onboard->>Sandbox: create/push image
    Sandbox-->>Onboard: error (OOM/timeout)
    Onboard->>Onboard: markStepFailed("sandbox")
    Onboard->>Session: saveSession(redacted failure)
    Session-->>Onboard: persisted
    end

    rect rgba(150,200,255,0.5)
    Note over User,Onboard: Resume run — skip completed steps
    User->>CLI: nemoclaw onboard --resume
    CLI->>Session: loadSession()
    Session-->>CLI: {steps, lastCompletedStep:"gateway", resumable:true}
    CLI->>Onboard: initialize(resume=true)
    Onboard->>Onboard: skip completed steps (preflight, gateway)
    Onboard->>Gateway: inspect gateway status
    Gateway-->>Onboard: healthy / reuse
    Onboard->>Sandbox: resume/create remaining steps
    Sandbox-->>Onboard: success
    Onboard->>Onboard: markStepComplete("sandbox")
    Onboard->>Session: completeSession()
    Session-->>Onboard: persisted complete
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hid each hop in tidy JSON rows,

I scrubbed the bits where secret danger grows.
If runs went bump and progress lost its thread,
I stitch the path so you resume ahead.
Hop on — your steps are safe inside my bed.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.12% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: harden installer and onboard resiliency' directly reflects the main changes: improvements to installer robustness, onboarding resumability, and runtime recovery.
Linked Issues check	✅ Passed	The PR comprehensively addresses all objectives from issue `#446`: implements resumable onboarding via session persistence and --resume flag, handles stale gateway state through recovery logic, and improves PATH readiness in the installer.
Out of Scope Changes check	✅ Passed	All changes are within scope: onboard-session for resumability, runtime-recovery for gateway state handling, installer PATH fixes, uninstall session cleanup, and supporting test/debug infrastructure.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/446-installer-resiliency

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

wscurran · 2026-03-24T14:54:19Z

Thanks for submitting this draft PR, it addresses issues with the installer and onboarding process, and the changes are well-documented and easy to follow, which could improve the user experience for new users. Let us know when you are ready for review.

…iliency # Conflicts: # bin/lib/onboard.js # test/onboard.test.js

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (3)

test/e2e/test-onboard-repair.sh (1)

253-255: Hardcoded model name may cause test fragility.

The assertion expects the exact model string nvidia/nemotron-3-super-120b-a12b. If the default model changes in the future, this test will fail even though the conflict detection logic is working correctly.

💡 Consider extracting expected model to a variable

+EXPECTED_DEFAULT_MODEL="${EXPECTED_DEFAULT_MODEL:-nvidia/nemotron-3-super-120b-a12b}"
+
 echo "$provider_conflict_output" | grep -q "Resumable state recorded model 'nvidia/nemotron-3-super-120b-a12b', not 'nvidia/conflicting-model'." \
+echo "$provider_conflict_output" | grep -q "Resumable state recorded model '${EXPECTED_DEFAULT_MODEL}', not 'nvidia/conflicting-model'." \
   && pass "Conflicting model message is explicit" \
   || fail "Conflicting model message missing or incorrect"

Alternatively, use a more relaxed pattern that just checks for the conflict structure without the exact model name.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-repair.sh` around lines 253 - 255, The test is brittle
because it hardcodes the exact model string in the grep assertion; update the
assertion in the test that checks provider_conflict_output to either use a
variable (e.g., define expected_model and substitute it into the check) or
switch to a relaxed regex that verifies the conflict message structure (e.g.,
checks for "Resumable state recorded model .* not .*" or "Resumable state
recorded model '.*', not '.*'") so the test verifies conflict semantics without
depending on a specific default model name.

test/e2e/test-onboard-resume.sh (1)

253-253: Consider adding a timeout guard for the E2E test.

Long-running E2E tests can hang indefinitely if external services become unresponsive. A top-level timeout would prevent CI from stalling.

💡 Optional: Add timeout wrapper

Add near the top of the script after set -uo pipefail:

# Fail if the entire test takes longer than 10 minutes
TIMEOUT_SECONDS=600
if command -v timeout >/dev/null 2>&1; then
  if [[ "${NEMOCLAW_E2E_NO_TIMEOUT:-}" != "1" ]]; then
    exec timeout --signal=TERM "$TIMEOUT_SECONDS" "$0" "$@" || exit $?
  fi
fi
export NEMOCLAW_E2E_NO_TIMEOUT=1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-resume.sh` at line 253, Add a top-level timeout guard
to the E2E script to prevent indefinite hangs: after the existing set -uo
pipefail line, wrap the whole script in a conditional re-exec that uses the
system timeout utility (respect the NEMOCLAW_E2E_NO_TIMEOUT env var to allow
disabling in local runs) and export that flag so the helper doesn’t reapply on
the re-exec; set a sane default (e.g., 600 seconds) for TIMEOUT_SECONDS and
ensure the re-exec uses TERM signal on timeout and returns the timeout exit code
so CI fails appropriately.

install.sh (1)

247-255: The ZSH_VERSION check is ineffective in this context.

Since the script runs under bash (per the shebang), ZSH_VERSION will never be set. The SHELL basename fallback handles the intended case, so this isn't a bug, but the first condition is dead code.

💡 Consider simplifying

 detect_shell_profile() {
   local profile="$HOME/.bashrc"
-  if [[ -n "${ZSH_VERSION:-}" ]] || [[ "$(basename "${SHELL:-}")" == "zsh" ]]; then
+  if [[ "$(basename "${SHELL:-}")" == "zsh" ]]; then
     profile="$HOME/.zshrc"
   elif [[ ! -f "$HOME/.bashrc" && -f "$HOME/.profile" ]]; then
     profile="$HOME/.profile"
   fi
   printf "%s" "$profile"
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@install.sh` around lines 247 - 255, The ZSH_VERSION check in
detect_shell_profile is dead code because the script runs under bash; remove the
redundant `[[ -n "${ZSH_VERSION:-}" ]]` clause and simplify the conditional to
rely on the SHELL basename check (i.e., keep the `[[ "$(basename "${SHELL:-}")"
== "zsh" ]]` branch), leaving the rest of detect_shell_profile (the fallback to
.profile and default to .bashrc) unchanged to preserve behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 2588-2590: The create step currently calls startRecordedStep then
awaits createSandbox and only persists sandboxName when createSandbox returns;
move persistence earlier so the chosen sandboxName is saved before the streaming
create begins: obtain or compute the sandboxName up-front (the value assigned to
sandboxName), call onboardSession.markStepComplete or another persistence method
to record the sandboxName (with provider, model, nimContainer as needed)
immediately after startRecordedStep and before awaiting createSandbox(), then
proceed to call createSandbox(gpu, model, provider, preferredInferenceApi) and
update the step result if needed; modify usages around startRecordedStep,
createSandbox, sandboxName, and onboardSession.markStepComplete accordingly.
- Around line 1239-1246: requestedProvider (from getRequestedProviderHint) is a
selection key and must be normalized to the effective provider name before
comparing with session.provider to avoid false conflicts; update the conflict
check in onboard.js to map/resolve requestedProvider to the runtime provider
name (use the same mapping/resolver used elsewhere in the codebase) and then
compare the resolved name to session.provider (i.e., resolve requestedProvider
-> effectiveProviderName, then compare effectiveProviderName !==
session.provider before pushing the conflicts entry).
- Around line 189-217: isGatewayHealthy currently treats any active gateway as
"healthy" which prevents getGatewayReuseState from ever returning
"active-unnamed"; change isGatewayHealthy(statusOutput, gwInfoOutput,
activeGatewayInfoOutput) to require the named gateway
(hasStaleGateway(gwInfoOutput) or a positive check for the named NemoClaw) in
addition to connected before returning true, so only a named NemoClaw gateway
yields "healthy"; keep the separate active-unnamed detection logic in
getGatewayReuseState (the existing activeGatewayInfoOutput checks) so
getGatewayReuseState can return "active-unnamed" when there's an active gateway
but no named NemoClaw.
- Around line 1562-1589: recoverGatewayRuntime currently returns true as soon as
any "Connected" status appears even if the attempt to select GATEWAY_NAME
failed; change it to verify the selection before signaling recovery by checking
the actually selected gateway after calling runOpenshell(["gateway","select",
GATEWAY_NAME]) (for example call runCaptureOpenshell(["gateway","current"]) or
parse an appropriate runCaptureOpenshell(["status"]) field) and only set
process.env.OPENSHELL_GATEWAY = GATEWAY_NAME and return true when the
current/active gateway equals GATEWAY_NAME; apply this check in both the initial
pre-loop check and inside the loop before treating the runtime as recovered
(references: recoverGatewayRuntime, runOpenshell, runCaptureOpenshell,
GATEWAY_NAME, process.env.OPENSHELL_GATEWAY).
- Around line 2504-2506: The three direct shell-outs that call openshell should
use the wrapper helper to respect resolved-binary behavior: replace the
runCapture invocations that populate gatewayStatus, gatewayInfo, and
activeGatewayInfo with calls to runCaptureOpenshell(...) (preserving the same
command string arguments and options), and apply the same change to any other
new resume-time openshell shell-outs in this file so they all use
runCaptureOpenshell instead of raw runCapture.
- Around line 2541-2560: When resumeProviderSelection is true the code restores
credentialEnv but does not populate the actual secret value into process.env,
causing setupInference (which reads process.env[credentialEnv]) to fail; update
the resume branch (the block guarded by resumeProviderSelection in onboard.js)
to hydrate the secret into process.env[credentialEnv] from the saved onboarding
state (e.g., the value stored on onboardSession or the persisted session object)
before proceeding to setupInference, ensuring credentialEnv and its secret are
available just like when setupNim() returns them.

---

Nitpick comments:
In `@install.sh`:
- Around line 247-255: The ZSH_VERSION check in detect_shell_profile is dead
code because the script runs under bash; remove the redundant `[[ -n
"${ZSH_VERSION:-}" ]]` clause and simplify the conditional to rely on the SHELL
basename check (i.e., keep the `[[ "$(basename "${SHELL:-}")" == "zsh" ]]`
branch), leaving the rest of detect_shell_profile (the fallback to .profile and
default to .bashrc) unchanged to preserve behavior.

In `@test/e2e/test-onboard-repair.sh`:
- Around line 253-255: The test is brittle because it hardcodes the exact model
string in the grep assertion; update the assertion in the test that checks
provider_conflict_output to either use a variable (e.g., define expected_model
and substitute it into the check) or switch to a relaxed regex that verifies the
conflict message structure (e.g., checks for "Resumable state recorded model .*
not .*" or "Resumable state recorded model '.*', not '.*'") so the test verifies
conflict semantics without depending on a specific default model name.

In `@test/e2e/test-onboard-resume.sh`:
- Line 253: Add a top-level timeout guard to the E2E script to prevent
indefinite hangs: after the existing set -uo pipefail line, wrap the whole
script in a conditional re-exec that uses the system timeout utility (respect
the NEMOCLAW_E2E_NO_TIMEOUT env var to allow disabling in local runs) and export
that flag so the helper doesn’t reapply on the re-exec; set a sane default
(e.g., 600 seconds) for TIMEOUT_SECONDS and ensure the re-exec uses TERM signal
on timeout and returns the timeout exit code so CI fails appropriately.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 897bdfbd-283f-4038-b3c2-af2d60b1dc16

📥 Commits

Reviewing files that changed from the base of the PR and between b2164e7 and a4e2821.

📒 Files selected for processing (17)

bin/lib/onboard-session.js
bin/lib/onboard.js
bin/lib/runtime-recovery.js
bin/nemoclaw.js
install.sh
scripts/debug.sh
scripts/install.sh
test/cli.test.js
test/e2e/test-double-onboard.sh
test/e2e/test-onboard-repair.sh
test/e2e/test-onboard-resume.sh
test/install-preflight.test.js
test/onboard-session.test.js
test/onboard.test.js
test/runtime-recovery.test.js
test/uninstall.test.js
uninstall.sh

bin/lib/onboard.js

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

test/e2e/test-onboard-resume.sh (1)

225-247: Relax hardcoded step-number assertions to reduce false failures.

The fixed [x/7] checks are fragile to benign workflow evolution. The semantic resume assertions above already validate the intended behavior.

💡 Suggested simplification

-if echo "$resume_output" | grep -q "\[1/7\] Preflight checks"; then
-  fail "Resume reran preflight unexpectedly"
-else
-  pass "Resume did not rerun preflight"
-fi
-
-if echo "$resume_output" | grep -q "\[2/7\] Starting OpenShell gateway"; then
-  fail "Resume reran gateway startup unexpectedly"
-else
-  pass "Resume did not rerun gateway startup"
-fi
-
-if echo "$resume_output" | grep -q "\[3/7\] Creating sandbox"; then
-  fail "Resume reran sandbox creation unexpectedly"
-else
-  pass "Resume did not rerun sandbox creation"
-fi
+# Keep semantic checks based on explicit resume markers:
+# - [resume] Skipping preflight (cached)
+# - [resume] Skipping gateway (running)
+# - [resume] Skipping sandbox (<name>)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-resume.sh` around lines 225 - 247, The tests currently
assert exact "[x/7]" step prefixes in the resume_output (the grep checks for
"\[1/7\] Preflight checks", "\[2/7\] Starting OpenShell gateway", "\[3/7\]
Creating sandbox", "\[5/7\] Setting up inference provider") which is fragile;
update each grep to match the semantic message without the hardcoded step
numbers (e.g., grep for "Preflight checks", "Starting OpenShell gateway",
"Creating sandbox", "Setting up inference provider" or use a regex that allows
an optional "[*/*]" prefix) so the assertions still verify the intended behavior
using resume_output but no longer fail on benign reordering/renumbering of
steps.

test/e2e/test-onboard-repair.sh (1)

267-277: Avoid hardcoding recorded provider/model in conflict checks.

These assertions depend on current defaults rather than the actual recorded session values, so they can fail after legitimate default changes.

💡 Suggested refactor (derive expectations from session file)

+RECORDED_PROVIDER="$(node -e 'const fs=require("fs");const s=JSON.parse(fs.readFileSync(process.argv[1],"utf8"));process.stdout.write(s.provider||"");' "$SESSION_FILE")"
+RECORDED_MODEL="$(node -e 'const fs=require("fs");const s=JSON.parse(fs.readFileSync(process.argv[1],"utf8"));process.stdout.write(s.model||"");' "$SESSION_FILE")"
+
-if echo "$provider_conflict_output" | grep -q "Resumable state recorded provider 'nvidia-nim', not 'cloud'."; then
+if echo "$provider_conflict_output" | grep -q "Resumable state recorded provider '${RECORDED_PROVIDER}', not 'cloud'."; then
   pass "Conflicting provider message is explicit"
 else
   fail "Conflicting provider message missing or incorrect"
 fi
 
-if echo "$provider_conflict_output" | grep -q "Resumable state recorded model 'nvidia/nemotron-3-super-120b-a12b', not 'nvidia/conflicting-model'."; then
+if echo "$provider_conflict_output" | grep -q "Resumable state recorded model '${RECORDED_MODEL}', not 'nvidia/conflicting-model'."; then
   pass "Conflicting model message is explicit"
 else
   fail "Conflicting model message missing or incorrect"
 fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-repair.sh` around lines 267 - 277, The test currently
hardcodes expected provider/model strings when asserting on
provider_conflict_output; instead parse the actual recorded values from the
session (e.g., read the session JSON used in the test) and build the expected
substrings dynamically, then assert that provider_conflict_output contains those
derived messages. Concretely, in test/e2e/test-onboard-repair.sh replace the
static grep literals with values extracted from the session file (the same file
used to record resumable state) and reference those variables when checking
provider_conflict_output and model conflict messages so the assertions follow
the recorded session values rather than fixed defaults.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-onboard-repair.sh`:
- Around line 63-80: Add a trap-backed guaranteed teardown so cleanup always
runs on early exits: at the top of the script (after
SANDBOX_NAME/OTHER_SANDBOX_NAME/SESSION_FILE are set) register a trap for EXIT
and ERR that invokes a cleanup function; implement that cleanup function to run
the same teardown commands (run_nemoclaw "$SANDBOX_NAME" destroy, run_nemoclaw
"$OTHER_SANDBOX_NAME" destroy, openshell sandbox delete for both names,
openshell forward stop 18789, openshell gateway destroy -g nemoclaw, and rm -f
"$SESSION_FILE") and ensure the trap is idempotent (safe to call multiple times)
so all exit paths no longer leak sandbox/gateway/session artifacts.

In `@test/e2e/test-onboard-resume.sh`:
- Around line 274-278: The test currently uses grep against $REGISTRY which can
match incidental substrings; change the assertion to parse the registry JSON and
verify a real sandbox entry exists. Replace the grep check with a jq-based check
that reads $REGISTRY and tests for an object with the sandbox name (e.g. jq
'any(.sandboxes[]; .name == env.SANDBOX_NAME)' or similar), and use that jq exit
status to call pass "Registry contains resumed sandbox entry" or fail "Registry
does not contain resumed sandbox entry" while keeping the same variables
($REGISTRY, $SANDBOX_NAME) and pass/fail calls.
- Around line 64-79: Define a reusable cleanup function (e.g., cleanup) that
runs the existing teardown steps (calling run_nemoclaw "$SANDBOX_NAME" destroy,
openshell sandbox delete "$SANDBOX_NAME", openshell forward stop 18789,
openshell gateway destroy -g nemoclaw and rm -f "$SESSION_FILE") and install it
as an EXIT trap near the top of the script so it always runs even if the script
aborts early; ensure the trap is registered immediately after initializing
SANDBOX_NAME/SESSION_FILE (and before any early-exit branches) and make the
cleanup function idempotent/silent on missing resources so repeated runs won’t
error.

---

Nitpick comments:
In `@test/e2e/test-onboard-repair.sh`:
- Around line 267-277: The test currently hardcodes expected provider/model
strings when asserting on provider_conflict_output; instead parse the actual
recorded values from the session (e.g., read the session JSON used in the test)
and build the expected substrings dynamically, then assert that
provider_conflict_output contains those derived messages. Concretely, in
test/e2e/test-onboard-repair.sh replace the static grep literals with values
extracted from the session file (the same file used to record resumable state)
and reference those variables when checking provider_conflict_output and model
conflict messages so the assertions follow the recorded session values rather
than fixed defaults.

In `@test/e2e/test-onboard-resume.sh`:
- Around line 225-247: The tests currently assert exact "[x/7]" step prefixes in
the resume_output (the grep checks for "\[1/7\] Preflight checks", "\[2/7\]
Starting OpenShell gateway", "\[3/7\] Creating sandbox", "\[5/7\] Setting up
inference provider") which is fragile; update each grep to match the semantic
message without the hardcoded step numbers (e.g., grep for "Preflight checks",
"Starting OpenShell gateway", "Creating sandbox", "Setting up inference
provider" or use a regex that allows an optional "[*/*]" prefix) so the
assertions still verify the intended behavior using resume_output but no longer
fail on benign reordering/renumbering of steps.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a4191510-d125-41e7-88d8-a61fe074532f

📥 Commits

Reviewing files that changed from the base of the PR and between a4e2821 and 9c73c7f.

📒 Files selected for processing (3)

scripts/debug.sh
test/e2e/test-onboard-repair.sh
test/e2e/test-onboard-resume.sh

coderabbitai · 2026-03-25T14:59:12Z

test/e2e/test-onboard-repair.sh

+SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-repair}"
+OTHER_SANDBOX_NAME="${NEMOCLAW_OTHER_SANDBOX_NAME:-e2e-other}"
+SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"
+RESTORE_API_KEY="${NVIDIA_API_KEY:-}"
+
+# ══════════════════════════════════════════════════════════════════
+# Phase 0: Pre-cleanup
+# ══════════════════════════════════════════════════════════════════
+section "Phase 0: Pre-cleanup"
+info "Destroying any leftover sandbox/gateway from previous runs..."
+run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
+run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true
+openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
+openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true
+openshell forward stop 18789 2>/dev/null || true
+openshell gateway destroy -g nemoclaw 2>/dev/null || true
+rm -f "$SESSION_FILE"
+pass "Pre-cleanup complete"


⚠️ Potential issue | 🟠 Major

Add guaranteed teardown on early exit paths.

There are multiple exit 1 branches before final cleanup, so failed runs can leak sandbox/gateway/session artifacts and poison subsequent runs.

💡 Suggested hardening (trap-backed cleanup)

SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-repair}" OTHER_SANDBOX_NAME="${NEMOCLAW_OTHER_SANDBOX_NAME:-e2e-other}" SESSION_FILE="$HOME/.nemoclaw/onboard-session.json" RESTORE_API_KEY="${NVIDIA_API_KEY:-}" + +cleanup() { + run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true + run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true + openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true + openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true + openshell forward stop 18789 2>/dev/null || true + openshell gateway destroy -g nemoclaw 2>/dev/null || true + rm -f "$SESSION_FILE" +} + +trap cleanup EXIT

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-repair}"

OTHER_SANDBOX_NAME="${NEMOCLAW_OTHER_SANDBOX_NAME:-e2e-other}"

SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"

RESTORE_API_KEY="${NVIDIA_API_KEY:-}"

# ══════════════════════════════════════════════════════════════════

# Phase 0: Pre-cleanup

# ══════════════════════════════════════════════════════════════════

section "Phase 0: Pre-cleanup"

info "Destroying any leftover sandbox/gateway from previous runs..."

run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true

run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true

openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true

openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true

openshell forward stop 18789 2>/dev/null || true

openshell gateway destroy -g nemoclaw 2>/dev/null || true

rm -f "$SESSION_FILE"

pass "Pre-cleanup complete"

SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-repair}"

OTHER_SANDBOX_NAME="${NEMOCLAW_OTHER_SANDBOX_NAME:-e2e-other}"

SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"

RESTORE_API_KEY="${NVIDIA_API_KEY:-}"

cleanup() {

run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true

run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true

openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true

openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true

openshell forward stop 18789 2>/dev/null || true

openshell gateway destroy -g nemoclaw 2>/dev/null || true

rm -f "$SESSION_FILE"

}

trap cleanup EXIT

# ══════════════════════════════════════════════════════════════════

# Phase 0: Pre-cleanup

# ══════════════════════════════════════════════════════════════════

section "Phase 0: Pre-cleanup"

info "Destroying any leftover sandbox/gateway from previous runs..."

run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true

run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true

openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true

openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true

openshell forward stop 18789 2>/dev/null || true

openshell gateway destroy -g nemoclaw 2>/dev/null || true

rm -f "$SESSION_FILE"

pass "Pre-cleanup complete"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-onboard-repair.sh` around lines 63 - 80, Add a trap-backed guaranteed teardown so cleanup always runs on early exits: at the top of the script (after SANDBOX_NAME/OTHER_SANDBOX_NAME/SESSION_FILE are set) register a trap for EXIT and ERR that invokes a cleanup function; implement that cleanup function to run the same teardown commands (run_nemoclaw "$SANDBOX_NAME" destroy, run_nemoclaw "$OTHER_SANDBOX_NAME" destroy, openshell sandbox delete for both names, openshell forward stop 18789, openshell gateway destroy -g nemoclaw, and rm -f "$SESSION_FILE") and ensure the trap is idempotent (safe to call multiple times) so all exit paths no longer leak sandbox/gateway/session artifacts.

coderabbitai · 2026-03-25T14:59:12Z

test/e2e/test-onboard-resume.sh

+SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-resume}"
+SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"
+REGISTRY="$HOME/.nemoclaw/sandboxes.json"
+RESTORE_API_KEY="${NVIDIA_API_KEY:-}"
+
+# ══════════════════════════════════════════════════════════════════
+# Phase 0: Pre-cleanup
+# ══════════════════════════════════════════════════════════════════
+section "Phase 0: Pre-cleanup"
+info "Destroying any leftover sandbox/gateway from previous runs..."
+run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
+openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
+openshell forward stop 18789 2>/dev/null || true
+openshell gateway destroy -g nemoclaw 2>/dev/null || true
+rm -f "$SESSION_FILE"
+pass "Pre-cleanup complete"


⚠️ Potential issue | 🟠 Major

Guarantee cleanup even when the test aborts early.

This script also exits before final cleanup in several branches, which can leave artifacts behind and destabilize later E2E runs.

💡 Suggested hardening (shared cleanup + EXIT trap)

SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-resume}" SESSION_FILE="$HOME/.nemoclaw/onboard-session.json" REGISTRY="$HOME/.nemoclaw/sandboxes.json" RESTORE_API_KEY="${NVIDIA_API_KEY:-}" + +cleanup() { + run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true + openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true + openshell forward stop 18789 2>/dev/null || true + openshell gateway destroy -g nemoclaw 2>/dev/null || true + rm -f "$SESSION_FILE" +} + +trap cleanup EXIT

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-onboard-resume.sh` around lines 64 - 79, Define a reusable cleanup function (e.g., cleanup) that runs the existing teardown steps (calling run_nemoclaw "$SANDBOX_NAME" destroy, openshell sandbox delete "$SANDBOX_NAME", openshell forward stop 18789, openshell gateway destroy -g nemoclaw and rm -f "$SESSION_FILE") and install it as an EXIT trap near the top of the script so it always runs even if the script aborts early; ensure the trap is registered immediately after initializing SANDBOX_NAME/SESSION_FILE (and before any early-exit branches) and make the cleanup function idempotent/silent on missing resources so repeated runs won’t error.

coderabbitai · 2026-03-25T14:59:12Z

test/e2e/test-onboard-resume.sh

+if [ -f "$REGISTRY" ] && grep -q "$SANDBOX_NAME" "$REGISTRY"; then
+  pass "Registry contains resumed sandbox entry"
+else
+  fail "Registry does not contain resumed sandbox entry"
+fi


⚠️ Potential issue | 🟡 Minor

Use structured JSON validation for registry assertions.

String grep on JSON can pass on incidental matches; parse the registry to verify an actual sandbox entry.

💡 Suggested assertion

-if [ -f "$REGISTRY" ] && grep -q "$SANDBOX_NAME" "$REGISTRY"; then +if [ -f "$REGISTRY" ] && node -e ' +const fs = require("fs"); +const name = process.argv[1]; +const file = process.argv[2]; +const data = JSON.parse(fs.readFileSync(file, "utf8")); +const entries = Array.isArray(data) ? data : Object.values(data || {}); +process.exit(entries.some((e) => e && e.name === name) ? 0 : 1); +' "$SANDBOX_NAME" "$REGISTRY"; then pass "Registry contains resumed sandbox entry" else fail "Registry does not contain resumed sandbox entry" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [ -f "$REGISTRY" ] && grep -q "$SANDBOX_NAME" "$REGISTRY"; then

pass "Registry contains resumed sandbox entry"

else

fail "Registry does not contain resumed sandbox entry"

fi

if [ -f "$REGISTRY" ] && node -e '

const fs = require("fs");

const name = process.argv[1];

const file = process.argv[2];

const data = JSON.parse(fs.readFileSync(file, "utf8"));

const entries = Array.isArray(data) ? data : Object.values(data || {});

process.exit(entries.some((e) => e && e.name === name) ? 0 : 1);

' "$SANDBOX_NAME" "$REGISTRY"; then

pass "Registry contains resumed sandbox entry"

else

fail "Registry does not contain resumed sandbox entry"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-onboard-resume.sh` around lines 274 - 278, The test currently uses grep against $REGISTRY which can match incidental substrings; change the assertion to parse the registry JSON and verify a real sandbox entry exists. Replace the grep check with a jq-based check that reads $REGISTRY and tests for an object with the sandbox name (e.g. jq 'any(.sandboxes[]; .name == env.SANDBOX_NAME)' or similar), and use that jq exit status to call pass "Registry contains resumed sandbox entry" or fail "Registry does not contain resumed sandbox entry" while keeping the same variables ($REGISTRY, $SANDBOX_NAME) and pass/fail calls.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@install.sh`:
- Around line 106-133: print_done() currently prints full PATH/profile recovery
instructions when NEMOCLAW_READY_NOW=false and
NEMOCLAW_RECOVERY_EXPORT_DIR/_PROFILE are set, duplicating the same instructions
later in post_install_message(); remove the duplication by either (A) deleting
the blocks in print_done() that print the "%s$%s export PATH=..." and "source
..." lines and keep the detailed recovery instructions only in
post_install_message(), or (B) replace those blocks in print_done() with a
single brief line (e.g., "See post-install instructions below for PATH/profile
recovery") so post_install_message() remains the sole source of detailed
recovery steps; edit the print_done() function accordingly and reference
NEMOCLAW_RECOVERY_EXPORT_DIR and NEMOCLAW_RECOVERY_PROFILE conditionals when
making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 78dc209c-7d60-4d1e-85f9-0b6d8544b9fc

📥 Commits

Reviewing files that changed from the base of the PR and between 9c73c7f and 8552e01.

📒 Files selected for processing (7)

bin/lib/onboard-session.js
bin/lib/onboard.js
install.sh
test/e2e/test-onboard-repair.sh
test/e2e/test-onboard-resume.sh
test/onboard-session.test.js
test/onboard.test.js

✅ Files skipped from review due to trivial changes (3)

test/onboard-session.test.js
bin/lib/onboard-session.js
test/e2e/test-onboard-resume.sh

🚧 Files skipped from review as they are similar to previous changes (2)

test/e2e/test-onboard-repair.sh
bin/lib/onboard.js

coderabbitai · 2026-03-25T15:12:56Z

install.sh

+  if [[ "$ONBOARD_RAN" == true ]]; then
+    local sandbox_name
+    sandbox_name="$(resolve_default_sandbox_name)"
+    printf "  ${C_GREEN}Your OpenClaw Sandbox is live.${C_RESET}\n"
+    printf "  ${C_DIM}Sandbox in, break things, and tell us what you find.${C_RESET}\n"
+    printf "\n"
+    printf "  ${C_GREEN}Next:${C_RESET}\n"
+    printf "  %s$%s nemoclaw %s connect\n" "$C_GREEN" "$C_RESET" "$sandbox_name"
+    printf "  %ssandbox@%s$%s openclaw tui\n" "$C_GREEN" "$sandbox_name" "$C_RESET"
+  elif [[ "$NEMOCLAW_READY_NOW" == true ]]; then
+    printf "  ${C_GREEN}NemoClaw CLI is ready in this shell.${C_RESET}\n"
+    printf "  ${C_DIM}Onboarding has not run yet.${C_RESET}\n"
+    printf "\n"
+    printf "  ${C_GREEN}Next:${C_RESET}\n"
+    printf "  %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"
+  else
+    printf "  ${C_GREEN}NemoClaw CLI is installed.${C_RESET}\n"
+    printf "  ${C_DIM}Onboarding did not run because this shell cannot resolve 'nemoclaw' yet.${C_RESET}\n"
+    printf "\n"
+    printf "  ${C_GREEN}Next:${C_RESET}\n"
+    if [[ -n "$NEMOCLAW_RECOVERY_EXPORT_DIR" ]]; then
+      printf "  %s$%s export PATH=\"%s:\$PATH\"\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_EXPORT_DIR"
+    fi
+    if [[ -n "$NEMOCLAW_RECOVERY_PROFILE" ]]; then
+      printf "  %s$%s source %s\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_PROFILE"
+    fi
+    printf "  %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"
+  fi


⚠️ Potential issue | 🟡 Minor

Potential duplicate recovery instructions.

When NEMOCLAW_READY_NOW=false and NEMOCLAW_RECOVERY_EXPORT_DIR is set, both print_done() (lines 126-132) and post_install_message() (lines 581-596) will print PATH recovery instructions, resulting in duplicate output for the user.

Consider either:

Removing the recovery instructions from print_done() and relying solely on post_install_message(), or

Having print_done() only print a brief message directing users to the instructions below.

🛠️ Proposed fix to avoid duplication

else printf " ${C_GREEN}NemoClaw CLI is installed.${C_RESET}\n" printf " ${C_DIM}Onboarding did not run because this shell cannot resolve 'nemoclaw' yet.${C_RESET}\n" printf "\n" - printf " ${C_GREEN}Next:${C_RESET}\n" - if [[ -n "$NEMOCLAW_RECOVERY_EXPORT_DIR" ]]; then - printf " %s$%s export PATH=\"%s:\$PATH\"\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_EXPORT_DIR" - fi - if [[ -n "$NEMOCLAW_RECOVERY_PROFILE" ]]; then - printf " %s$%s source %s\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_PROFILE" - fi - printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET" + printf " ${C_DIM}See recovery instructions below.${C_RESET}\n" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [[ "$ONBOARD_RAN" == true ]]; then

local sandbox_name

sandbox_name="$(resolve_default_sandbox_name)"

printf " ${C_GREEN}Your OpenClaw Sandbox is live.${C_RESET}\n"

printf " ${C_DIM}Sandbox in, break things, and tell us what you find.${C_RESET}\n"

printf "\n"

printf " ${C_GREEN}Next:${C_RESET}\n"

printf " %s$%s nemoclaw %s connect\n" "$C_GREEN" "$C_RESET" "$sandbox_name"

printf " %ssandbox@%s$%s openclaw tui\n" "$C_GREEN" "$sandbox_name" "$C_RESET"

elif [[ "$NEMOCLAW_READY_NOW" == true ]]; then

printf " ${C_GREEN}NemoClaw CLI is ready in this shell.${C_RESET}\n"

printf " ${C_DIM}Onboarding has not run yet.${C_RESET}\n"

printf "\n"

printf " ${C_GREEN}Next:${C_RESET}\n"

printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"

else

printf " ${C_GREEN}NemoClaw CLI is installed.${C_RESET}\n"

printf " ${C_DIM}Onboarding did not run because this shell cannot resolve 'nemoclaw' yet.${C_RESET}\n"

printf "\n"

printf " ${C_GREEN}Next:${C_RESET}\n"

if [[ -n "$NEMOCLAW_RECOVERY_EXPORT_DIR" ]]; then

printf " %s$%s export PATH=\"%s:\$PATH\"\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_EXPORT_DIR"

fi

if [[ -n "$NEMOCLAW_RECOVERY_PROFILE" ]]; then

printf " %s$%s source %s\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_PROFILE"

fi

printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"

fi

if [[ "$ONBOARD_RAN" == true ]]; then

local sandbox_name

sandbox_name="$(resolve_default_sandbox_name)"

printf " ${C_GREEN}Your OpenClaw Sandbox is live.${C_RESET}\n"

printf " ${C_DIM}Sandbox in, break things, and tell us what you find.${C_RESET}\n"

printf "\n"

printf " ${C_GREEN}Next:${C_RESET}\n"

printf " %s$%s nemoclaw %s connect\n" "$C_GREEN" "$C_RESET" "$sandbox_name"

printf " %ssandbox@%s$%s openclaw tui\n" "$C_GREEN" "$sandbox_name" "$C_RESET"

elif [[ "$NEMOCLAW_READY_NOW" == true ]]; then

printf " ${C_GREEN}NemoClaw CLI is ready in this shell.${C_RESET}\n"

printf " ${C_DIM}Onboarding has not run yet.${C_RESET}\n"

printf "\n"

printf " ${C_GREEN}Next:${C_RESET}\n"

printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"

else

printf " ${C_GREEN}NemoClaw CLI is installed.${C_RESET}\n"

printf " ${C_DIM}Onboarding did not run because this shell cannot resolve 'nemoclaw' yet.${C_RESET}\n"

printf "\n"

printf " ${C_DIM}See recovery instructions below.${C_RESET}\n"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@install.sh` around lines 106 - 133, print_done() currently prints full PATH/profile recovery instructions when NEMOCLAW_READY_NOW=false and NEMOCLAW_RECOVERY_EXPORT_DIR/_PROFILE are set, duplicating the same instructions later in post_install_message(); remove the duplication by either (A) deleting the blocks in print_done() that print the "%s$%s export PATH=..." and "source ..." lines and keep the detailed recovery instructions only in post_install_message(), or (B) replace those blocks in print_done() with a single brief line (e.g., "See post-install instructions below for PATH/profile recovery") so post_install_message() remains the sole source of detailed recovery steps; edit the print_done() function accordingly and reference NEMOCLAW_RECOVERY_EXPORT_DIR and NEMOCLAW_RECOVERY_PROFILE conditionals when making the change.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

bin/lib/onboard.js (2)

2569-2594: ⚠️ Potential issue | 🟠 Major

Hydrate stored credentials when resuming provider selection.

When resumeProviderSelection is true, credentialEnv is restored from the session (e.g., "NVIDIA_API_KEY"), but the actual secret value is never loaded into process.env. setupInference() at Line 2598 builds provider env from process.env[credentialEnv], which will be undefined for resumed remote-provider sessions unless the user re-exported the key in their current shell.

Proposed fix

   if (resumeProviderSelection) {
+    if (credentialEnv && !process.env[credentialEnv]) {
+      const savedCredential = getCredential(credentialEnv);
+      if (savedCredential) {
+        process.env[credentialEnv] = savedCredential;
+      }
+    }
     resumeStepMessage("provider selection", `${provider} / ${model}`);
   } else {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2569 - 2594, When resumeProviderSelection is
true you restore credentialEnv (the env var name) but not the secret value
itself, so before calling setupInference / continuing the flow set
process.env[credentialEnv] to the stored secret from the session (e.g.
session.steps.provider_selection.<credentialValueField>); update the resume
branch that uses resumeStepMessage to also read the provider_selection step
payload (the stored credential value) and assign it into
process.env[credentialEnv] so setupInference sees the secret when using
process.env[credentialEnv].

240-246: ⚠️ Potential issue | 🟠 Major

Use runOpenshell() for sandbox repair commands.

Similar to the previous issue, repairRecordedSandbox uses raw openshell commands via run() instead of the resolved-binary wrapper.

Proposed fix

 function repairRecordedSandbox(sandboxName) {
   if (!sandboxName) return;
   note(`  [resume] Cleaning up recorded sandbox '${sandboxName}' before recreating it.`);
-  run(`openshell forward stop 18789 2>/dev/null || true`, { ignoreError: true });
-  run(`openshell sandbox delete "${sandboxName}" 2>/dev/null || true`, { ignoreError: true });
+  runOpenshell(["forward", "stop", "18789"], { ignoreError: true });
+  runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true });
   registry.removeSandbox(sandboxName);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 240 - 246, The repairRecordedSandbox
function currently calls raw commands via run(...) — replace those calls with
the resolved-binary wrapper runOpenshell to ensure the correct openshell binary
is used: in repairRecordedSandbox, change run(`openshell forward stop 18789
2>/dev/null || true`, { ignoreError: true }) to use runOpenshell('forward',
['stop','18789'], { ignoreError: true }) (or equivalent API your runOpenshell
exposes), and change run(`openshell sandbox delete "${sandboxName}" 2>/dev/null
|| true`, { ignoreError: true }) to runOpenshell('sandbox', ['delete',
sandboxName], { ignoreError: true }); keep the note(...) and
registry.removeSandbox(sandboxName) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 233-238: The getSandboxReuseState function currently calls
runCapture to invoke openshell which can fail if openshell is in ~/.local/bin
and PATH wasn't reloaded; change both calls that set getOutput and listOutput to
use runCaptureOpenshell instead of runCapture so openshell is resolved reliably,
keeping the same arguments (including ignoreError: true) and then pass those
outputs into getSandboxStateFromOutputs as before; update references to
getOutput and listOutput in getSandboxReuseState accordingly.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 2569-2594: When resumeProviderSelection is true you restore
credentialEnv (the env var name) but not the secret value itself, so before
calling setupInference / continuing the flow set process.env[credentialEnv] to
the stored secret from the session (e.g.
session.steps.provider_selection.<credentialValueField>); update the resume
branch that uses resumeStepMessage to also read the provider_selection step
payload (the stored credential value) and assign it into
process.env[credentialEnv] so setupInference sees the secret when using
process.env[credentialEnv].
- Around line 240-246: The repairRecordedSandbox function currently calls raw
commands via run(...) — replace those calls with the resolved-binary wrapper
runOpenshell to ensure the correct openshell binary is used: in
repairRecordedSandbox, change run(`openshell forward stop 18789 2>/dev/null ||
true`, { ignoreError: true }) to use runOpenshell('forward', ['stop','18789'], {
ignoreError: true }) (or equivalent API your runOpenshell exposes), and change
run(`openshell sandbox delete "${sandboxName}" 2>/dev/null || true`, {
ignoreError: true }) to runOpenshell('sandbox', ['delete', sandboxName], {
ignoreError: true }); keep the note(...) and registry.removeSandbox(sandboxName)
unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0cbb97c6-c79d-4ab2-9330-5b2eb2d6e1b7

📥 Commits

Reviewing files that changed from the base of the PR and between 8552e01 and b5d4f85.

📒 Files selected for processing (1)

bin/lib/onboard.js

bin/lib/onboard.js

kjw3 · 2026-03-26T05:23:24Z

Superseded by the signed-commit replacement PR created from branch . Closing this one to keep review on the replacement branch.

cv · 2026-03-27T17:19:52Z

New PR: #961

fix: harden installer and onboard resiliency

36574ff

wscurran added bug Something isn't working Getting Started Use this label to identify setup, installation, or onboarding issues. good first issue Good for newcomers priority: medium Issue that should be addressed in upcoming releases labels Mar 24, 2026

kjw3 added 9 commits March 24, 2026 22:17

Merge origin/main into fix/446-installer-resiliency

c53bb20

fix: improve onboard recovery after stale sandbox state

7fe4389

Merge remote-tracking branch 'origin/main' into fix/446-installer-res…

2acf5a3

…iliency # Conflicts: # bin/lib/onboard.js # test/onboard.test.js

fix: recover gateway runtime after restart drift

8721b13

fix: reuse healthy sandboxes on repeat onboard

6ebb41d

fix: classify sandbox image transfer failures

beffc78

test: cover sandbox image transfer recovery hints

cb0428b

fix: redact onboard session failure details

0f4c0d3

Merge branch 'main' into fix/446-installer-resiliency

a4e2821

kjw3 marked this pull request as ready for review March 25, 2026 14:38

test: satisfy shellcheck for resiliency scripts

9c73c7f

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

kjw3 added 2 commits March 25, 2026 11:04

fix: address onboard review follow-ups

8552e01

fix: handle ansi gateway metadata in onboarding

22b2797

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

Merge branch 'main' into fix/446-installer-resiliency

b5d4f85

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

bin/lib/onboard.js Show resolved Hide resolved

kjw3 added 5 commits March 25, 2026 12:06

fix: use openshell wrappers in onboard recovery

f8f7b96

Merge branch 'main' into fix/446-installer-resiliency

09e2121

fix(onboard): restore resiliency recovery flows

a8c64c7

test(e2e): align resiliency flows with onboard order

11b311d

fix(onboard): reuse healthy gateways during reruns

f90c693

kjw3 added 3 commits March 25, 2026 15:49

fix(onboard): allow healthy gateway port reuse

359f7c9

fix(onboard): allow healthy dashboard forward reuse

44acebd

test(e2e): refresh resiliency expectations

98e9b6d

cv mentioned this pull request Mar 25, 2026

refactor(cli): shell consolidation, TypeScript migration & oclif #924

Open

kjw3 added 8 commits March 25, 2026 18:30

fix: serialize onboard runs and correct resume step order

c9ba0bb

Merge origin/main into fix/446-installer-resiliency

0152363

fix: classify foreign gateway reuse conservatively

0b3a598

fix: restore legacy installer wrapper behavior

aa05263

Merge origin/main into fix/446-installer-resiliency

b1dbe9d

fix: simplify printed OpenClaw UI URLs in onboard output

0125332

fix: restore gateway recovery coverage paths

e289249

fix: align onboard control ui output with main

fee98f2

kjw3 closed this Mar 26, 2026

kjw3 mentioned this pull request Mar 26, 2026

fix: harden installer and onboard resiliency #961

Merged

kjw3 deleted the fix/446-installer-resiliency branch March 26, 2026 05:25

-if [ -f "$REGISTRY" ] && grep -q "$SANDBOX_NAME" "$REGISTRY"; then
-  pass "Registry contains resumed sandbox entry"
-else
-  fail "Registry does not contain resumed sandbox entry"
-fi
+if [ -f "$REGISTRY" ] && node -e '
+const fs = require("fs");
+const name = process.argv[1];
+const file = process.argv[2];
+const data = JSON.parse(fs.readFileSync(file, "utf8"));
+const entries = Array.isArray(data) ? data : Object.values(data || {});
+process.exit(entries.some((e) => e && e.name === name) ? 0 : 1);
+' "$SANDBOX_NAME" "$REGISTRY"; then
+  pass "Registry contains resumed sandbox entry"
+else
+  fail "Registry does not contain resumed sandbox entry"
+fi

Conversation

kjw3 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

How To Test

Validation

Uh oh!

coderabbitai bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

wscurran commented Mar 24, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kjw3 commented Mar 26, 2026

Uh oh!

cv commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kjw3 commented Mar 24, 2026 •

edited

Loading

coderabbitai bot commented Mar 24, 2026 •

edited

Loading