Skip to content

fix: harden installer and onboard resiliency#770

Closed
kjw3 wants to merge 30 commits intomainfrom
fix/446-installer-resiliency
Closed

fix: harden installer and onboard resiliency#770
kjw3 wants to merge 30 commits intomainfrom
fix/446-installer-resiliency

Conversation

@kjw3
Copy link
Copy Markdown
Contributor

@kjw3 kjw3 commented Mar 24, 2026

Summary

Hardens NemoClaw install and onboarding resiliency across repeat runs, partial failures, and restart/runtime drift.

Addresses:

What Changed

  • Added persisted onboarding session state and nemoclaw onboard --resume
  • Added sanitized onboarding state to nemoclaw debug
  • Made repeat onboard non-destructive by default when the sandbox is healthy
  • Improved recovery and messaging for gateway/runtime drift after restart
  • Improved sandbox image transfer failure classification and recovery guidance
  • Hardened installer/uninstall behavior and related shell/session handling
  • Tightened session-state redaction and secret handling

How To Test

Start from the PR branch:

git fetch origin
git switch fix/446-installer-resiliency
git pull --ff-only origin fix/446-installer-resiliency

Please focus on resiliency and recovery behavior:

  1. Fresh onboard

    • ./install.sh
    • nemoclaw onboard
    • Verify nemoclaw <sandbox> status and nemoclaw <sandbox> connect
  2. Interrupted onboard + resume

    • Interrupt or fail onboard mid-run
    • Fix the underlying issue
    • Run nemoclaw onboard --resume
    • Verify completed steps are skipped when safe and onboarding completes without full cleanup
  3. Repeat onboard

    • Run onboard again with the same sandbox name
    • Verify a healthy sandbox is reused, not silently recreated
    • Run onboard with a different sandbox name
    • Verify the old sandbox remains intact
  4. Runtime/gateway drift

    • After onboarding, restart the relevant runtime/gateway context if feasible
    • Run nemoclaw <sandbox> status and nemoclaw <sandbox> connect
    • Verify NemoClaw attempts recovery first and falls back to clear onboard --resume guidance when needed
  5. Debug visibility

    • nemoclaw debug --quick
    • Verify onboarding session state is present and secrets are not exposed

Please include in any bug report:

  • platform/environment
  • exact command run
  • full terminal output
  • nemoclaw debug --quick
  • openshell status
  • openshell sandbox list
  • openshell gateway info

Validation

  • Focused Vitest coverage for onboard session state, resume logic, runtime recovery, installer/uninstall behavior, and sandbox image transfer recovery paths
  • E2E coverage for interrupted onboard resume and repair flows
  • Pre-push hook stack passing on the branch

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds resumable per-user onboarding sessions with versioned JSON persistence, sensitive-data redaction, atomic writes, and lifecycle helpers; introduces resume/repair logic for gateways and sandboxes, runtime-recovery heuristics and CLI --resume; updates installer PATH/recovery messaging, debug/uninstall scripts, and extensive unit and E2E tests.

Changes

Cohort / File(s) Summary
Session storage
bin/lib/onboard-session.js
New module: per-user JSON session with versioned schema, tolerant load, normalization, sensitive-text redaction, atomic save (temp file + rename with 0600/0700 perms), clear/delete, lifecycle helpers (markStepStarted/markStepComplete/markStepFailed), completeSession, updateSession, and summarizeForDebug.
Onboard engine & integration
bin/lib/onboard.js
Integrated session resume/save into onboarding flow; skip/resume completed steps; record failures on abnormal exit; gateway reuse classification and recovery (getGatewayStartEnv, recoverGatewayRuntime); sandbox reuse/repair and registry cleanup; build-context filtering/copying; sandbox-create failure classification and recovery hints; many exported helpers for resume/conflict handling.
Runtime recovery helpers
bin/lib/runtime-recovery.js
New helpers to strip/parse CLI output, parse live sandbox names, classify sandbox/gateway lookup/status, decide when to attempt recovery, and choose recovery command (onboard --resume vs fresh).
CLI runtime & registry
bin/nemoclaw.js
Accepts --resume, forwards to onboard entrypoint; async sandbox connect/status that classify openshell output; reconcileRegistryWithGateway() to prune stale registry entries; restoreRuntimeIfNeeded() to conditionally recover runtime and re-run classification.
Installer & wrapper
install.sh, scripts/install.sh
install.sh: detect shell profile, set PATH-recovery flags, and emit state-driven completion messages; scripts/install.sh: converted to deprecated legacy wrapper that execs repo installer or prints deprecation guidance.
Debug & uninstall scripts
scripts/debug.sh, uninstall.sh
debug.sh: adds "Onboard Session" diagnostic that prints summarizeForDebug() when available. uninstall.sh: only removes symlinked installer shim (preserves regular user-managed files) and documents onboard-session.json in cleanup.
Build/context helpers
bin/lib/onboard.js
Adds helpers to filter and copy build-context paths (exclude caches, VCS, env-specific dirs) for sandbox staging.
Tests — unit & integration
test/onboard-session.test.js, test/runtime-recovery.test.js, test/onboard.test.js, test/cli.test.js, test/install-preflight.test.js, test/uninstall.test.js
New/updated unit tests covering session lifecycle, redaction, sandbox/gateway classification, build-context filtering, resume conflict detection, --resume parsing, legacy-wrapper behavior, and uninstall session cleanup.
Tests — E2E
test/e2e/test-onboard-resume.sh, test/e2e/test-onboard-repair.sh, test/e2e/test-double-onboard.sh
New/updated E2E scripts validating interrupted onboarding, non-interactive resume, sandbox repair/invalidation, and safe sandbox reuse across multi-phase runs.
CLI dispatch test
test/cli.test.js
Added test asserting --resume is accepted and unknown flags still produce expected errors; debug output test updated to include "Onboard Session".

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as "nemoclaw CLI"
    participant Session as "Onboard Session\n(persisted JSON)"
    participant Onboard as "Onboard Engine"
    participant Gateway as "OpenShell Gateway"
    participant Sandbox as "OpenShell Sandbox"

    rect rgba(200,150,255,0.5)
    Note over User,Onboard: Initial run — failure mid-step
    User->>CLI: nemoclaw onboard
    CLI->>Session: loadSession()
    Session-->>CLI: null
    CLI->>Onboard: start onboarding
    Onboard->>Onboard: run steps (preflight → gateway → sandbox ...)
    Onboard->>Gateway: start gateway
    Gateway-->>Onboard: ready
    Onboard->>Onboard: markStepStarted("sandbox")
    Onboard->>Sandbox: create/push image
    Sandbox-->>Onboard: error (OOM/timeout)
    Onboard->>Onboard: markStepFailed("sandbox")
    Onboard->>Session: saveSession(redacted failure)
    Session-->>Onboard: persisted
    end

    rect rgba(150,200,255,0.5)
    Note over User,Onboard: Resume run — skip completed steps
    User->>CLI: nemoclaw onboard --resume
    CLI->>Session: loadSession()
    Session-->>CLI: {steps, lastCompletedStep:"gateway", resumable:true}
    CLI->>Onboard: initialize(resume=true)
    Onboard->>Onboard: skip completed steps (preflight, gateway)
    Onboard->>Gateway: inspect gateway status
    Gateway-->>Onboard: healthy / reuse
    Onboard->>Sandbox: resume/create remaining steps
    Sandbox-->>Onboard: success
    Onboard->>Onboard: markStepComplete("sandbox")
    Onboard->>Session: completeSession()
    Session-->>Onboard: persisted complete
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hid each hop in tidy JSON rows,

I scrubbed the bits where secret danger grows.
If runs went bump and progress lost its thread,
I stitch the path so you resume ahead.
Hop on — your steps are safe inside my bed.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: harden installer and onboard resiliency' directly reflects the main changes: improvements to installer robustness, onboarding resumability, and runtime recovery.
Linked Issues check ✅ Passed The PR comprehensively addresses all objectives from issue #446: implements resumable onboarding via session persistence and --resume flag, handles stale gateway state through recovery logic, and improves PATH readiness in the installer.
Out of Scope Changes check ✅ Passed All changes are within scope: onboard-session for resumability, runtime-recovery for gateway state handling, installer PATH fixes, uninstall session cleanup, and supporting test/debug infrastructure.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/446-installer-resiliency

Comment @coderabbitai help to get the list of available commands and usage tips.

@wscurran wscurran added bug Something isn't working Getting Started Use this label to identify setup, installation, or onboarding issues. good first issue Good for newcomers priority: medium Issue that should be addressed in upcoming releases labels Mar 24, 2026
@wscurran
Copy link
Copy Markdown
Contributor

Thanks for submitting this draft PR, it addresses issues with the installer and onboarding process, and the changes are well-documented and easy to follow, which could improve the user experience for new users. Let us know when you are ready for review.

@kjw3 kjw3 marked this pull request as ready for review March 25, 2026 14:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
test/e2e/test-onboard-repair.sh (1)

253-255: Hardcoded model name may cause test fragility.

The assertion expects the exact model string nvidia/nemotron-3-super-120b-a12b. If the default model changes in the future, this test will fail even though the conflict detection logic is working correctly.

💡 Consider extracting expected model to a variable
+EXPECTED_DEFAULT_MODEL="${EXPECTED_DEFAULT_MODEL:-nvidia/nemotron-3-super-120b-a12b}"
+
 echo "$provider_conflict_output" | grep -q "Resumable state recorded model 'nvidia/nemotron-3-super-120b-a12b', not 'nvidia/conflicting-model'." \
+echo "$provider_conflict_output" | grep -q "Resumable state recorded model '${EXPECTED_DEFAULT_MODEL}', not 'nvidia/conflicting-model'." \
   && pass "Conflicting model message is explicit" \
   || fail "Conflicting model message missing or incorrect"

Alternatively, use a more relaxed pattern that just checks for the conflict structure without the exact model name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-repair.sh` around lines 253 - 255, The test is brittle
because it hardcodes the exact model string in the grep assertion; update the
assertion in the test that checks provider_conflict_output to either use a
variable (e.g., define expected_model and substitute it into the check) or
switch to a relaxed regex that verifies the conflict message structure (e.g.,
checks for "Resumable state recorded model .* not .*" or "Resumable state
recorded model '.*', not '.*'") so the test verifies conflict semantics without
depending on a specific default model name.
test/e2e/test-onboard-resume.sh (1)

253-253: Consider adding a timeout guard for the E2E test.

Long-running E2E tests can hang indefinitely if external services become unresponsive. A top-level timeout would prevent CI from stalling.

💡 Optional: Add timeout wrapper

Add near the top of the script after set -uo pipefail:

# Fail if the entire test takes longer than 10 minutes
TIMEOUT_SECONDS=600
if command -v timeout >/dev/null 2>&1; then
  if [[ "${NEMOCLAW_E2E_NO_TIMEOUT:-}" != "1" ]]; then
    exec timeout --signal=TERM "$TIMEOUT_SECONDS" "$0" "$@" || exit $?
  fi
fi
export NEMOCLAW_E2E_NO_TIMEOUT=1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-resume.sh` at line 253, Add a top-level timeout guard
to the E2E script to prevent indefinite hangs: after the existing set -uo
pipefail line, wrap the whole script in a conditional re-exec that uses the
system timeout utility (respect the NEMOCLAW_E2E_NO_TIMEOUT env var to allow
disabling in local runs) and export that flag so the helper doesn’t reapply on
the re-exec; set a sane default (e.g., 600 seconds) for TIMEOUT_SECONDS and
ensure the re-exec uses TERM signal on timeout and returns the timeout exit code
so CI fails appropriately.
install.sh (1)

247-255: The ZSH_VERSION check is ineffective in this context.

Since the script runs under bash (per the shebang), ZSH_VERSION will never be set. The SHELL basename fallback handles the intended case, so this isn't a bug, but the first condition is dead code.

💡 Consider simplifying
 detect_shell_profile() {
   local profile="$HOME/.bashrc"
-  if [[ -n "${ZSH_VERSION:-}" ]] || [[ "$(basename "${SHELL:-}")" == "zsh" ]]; then
+  if [[ "$(basename "${SHELL:-}")" == "zsh" ]]; then
     profile="$HOME/.zshrc"
   elif [[ ! -f "$HOME/.bashrc" && -f "$HOME/.profile" ]]; then
     profile="$HOME/.profile"
   fi
   printf "%s" "$profile"
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@install.sh` around lines 247 - 255, The ZSH_VERSION check in
detect_shell_profile is dead code because the script runs under bash; remove the
redundant `[[ -n "${ZSH_VERSION:-}" ]]` clause and simplify the conditional to
rely on the SHELL basename check (i.e., keep the `[[ "$(basename "${SHELL:-}")"
== "zsh" ]]` branch), leaving the rest of detect_shell_profile (the fallback to
.profile and default to .bashrc) unchanged to preserve behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 2588-2590: The create step currently calls startRecordedStep then
awaits createSandbox and only persists sandboxName when createSandbox returns;
move persistence earlier so the chosen sandboxName is saved before the streaming
create begins: obtain or compute the sandboxName up-front (the value assigned to
sandboxName), call onboardSession.markStepComplete or another persistence method
to record the sandboxName (with provider, model, nimContainer as needed)
immediately after startRecordedStep and before awaiting createSandbox(), then
proceed to call createSandbox(gpu, model, provider, preferredInferenceApi) and
update the step result if needed; modify usages around startRecordedStep,
createSandbox, sandboxName, and onboardSession.markStepComplete accordingly.
- Around line 1239-1246: requestedProvider (from getRequestedProviderHint) is a
selection key and must be normalized to the effective provider name before
comparing with session.provider to avoid false conflicts; update the conflict
check in onboard.js to map/resolve requestedProvider to the runtime provider
name (use the same mapping/resolver used elsewhere in the codebase) and then
compare the resolved name to session.provider (i.e., resolve requestedProvider
-> effectiveProviderName, then compare effectiveProviderName !==
session.provider before pushing the conflicts entry).
- Around line 189-217: isGatewayHealthy currently treats any active gateway as
"healthy" which prevents getGatewayReuseState from ever returning
"active-unnamed"; change isGatewayHealthy(statusOutput, gwInfoOutput,
activeGatewayInfoOutput) to require the named gateway
(hasStaleGateway(gwInfoOutput) or a positive check for the named NemoClaw) in
addition to connected before returning true, so only a named NemoClaw gateway
yields "healthy"; keep the separate active-unnamed detection logic in
getGatewayReuseState (the existing activeGatewayInfoOutput checks) so
getGatewayReuseState can return "active-unnamed" when there's an active gateway
but no named NemoClaw.
- Around line 1562-1589: recoverGatewayRuntime currently returns true as soon as
any "Connected" status appears even if the attempt to select GATEWAY_NAME
failed; change it to verify the selection before signaling recovery by checking
the actually selected gateway after calling runOpenshell(["gateway","select",
GATEWAY_NAME]) (for example call runCaptureOpenshell(["gateway","current"]) or
parse an appropriate runCaptureOpenshell(["status"]) field) and only set
process.env.OPENSHELL_GATEWAY = GATEWAY_NAME and return true when the
current/active gateway equals GATEWAY_NAME; apply this check in both the initial
pre-loop check and inside the loop before treating the runtime as recovered
(references: recoverGatewayRuntime, runOpenshell, runCaptureOpenshell,
GATEWAY_NAME, process.env.OPENSHELL_GATEWAY).
- Around line 2504-2506: The three direct shell-outs that call openshell should
use the wrapper helper to respect resolved-binary behavior: replace the
runCapture invocations that populate gatewayStatus, gatewayInfo, and
activeGatewayInfo with calls to runCaptureOpenshell(...) (preserving the same
command string arguments and options), and apply the same change to any other
new resume-time openshell shell-outs in this file so they all use
runCaptureOpenshell instead of raw runCapture.
- Around line 2541-2560: When resumeProviderSelection is true the code restores
credentialEnv but does not populate the actual secret value into process.env,
causing setupInference (which reads process.env[credentialEnv]) to fail; update
the resume branch (the block guarded by resumeProviderSelection in onboard.js)
to hydrate the secret into process.env[credentialEnv] from the saved onboarding
state (e.g., the value stored on onboardSession or the persisted session object)
before proceeding to setupInference, ensuring credentialEnv and its secret are
available just like when setupNim() returns them.

---

Nitpick comments:
In `@install.sh`:
- Around line 247-255: The ZSH_VERSION check in detect_shell_profile is dead
code because the script runs under bash; remove the redundant `[[ -n
"${ZSH_VERSION:-}" ]]` clause and simplify the conditional to rely on the SHELL
basename check (i.e., keep the `[[ "$(basename "${SHELL:-}")" == "zsh" ]]`
branch), leaving the rest of detect_shell_profile (the fallback to .profile and
default to .bashrc) unchanged to preserve behavior.

In `@test/e2e/test-onboard-repair.sh`:
- Around line 253-255: The test is brittle because it hardcodes the exact model
string in the grep assertion; update the assertion in the test that checks
provider_conflict_output to either use a variable (e.g., define expected_model
and substitute it into the check) or switch to a relaxed regex that verifies the
conflict message structure (e.g., checks for "Resumable state recorded model .*
not .*" or "Resumable state recorded model '.*', not '.*'") so the test verifies
conflict semantics without depending on a specific default model name.

In `@test/e2e/test-onboard-resume.sh`:
- Line 253: Add a top-level timeout guard to the E2E script to prevent
indefinite hangs: after the existing set -uo pipefail line, wrap the whole
script in a conditional re-exec that uses the system timeout utility (respect
the NEMOCLAW_E2E_NO_TIMEOUT env var to allow disabling in local runs) and export
that flag so the helper doesn’t reapply on the re-exec; set a sane default
(e.g., 600 seconds) for TIMEOUT_SECONDS and ensure the re-exec uses TERM signal
on timeout and returns the timeout exit code so CI fails appropriately.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 897bdfbd-283f-4038-b3c2-af2d60b1dc16

📥 Commits

Reviewing files that changed from the base of the PR and between b2164e7 and a4e2821.

📒 Files selected for processing (17)
  • bin/lib/onboard-session.js
  • bin/lib/onboard.js
  • bin/lib/runtime-recovery.js
  • bin/nemoclaw.js
  • install.sh
  • scripts/debug.sh
  • scripts/install.sh
  • test/cli.test.js
  • test/e2e/test-double-onboard.sh
  • test/e2e/test-onboard-repair.sh
  • test/e2e/test-onboard-resume.sh
  • test/install-preflight.test.js
  • test/onboard-session.test.js
  • test/onboard.test.js
  • test/runtime-recovery.test.js
  • test/uninstall.test.js
  • uninstall.sh

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
test/e2e/test-onboard-resume.sh (1)

225-247: Relax hardcoded step-number assertions to reduce false failures.

The fixed [x/7] checks are fragile to benign workflow evolution. The semantic resume assertions above already validate the intended behavior.

💡 Suggested simplification
-if echo "$resume_output" | grep -q "\[1/7\] Preflight checks"; then
-  fail "Resume reran preflight unexpectedly"
-else
-  pass "Resume did not rerun preflight"
-fi
-
-if echo "$resume_output" | grep -q "\[2/7\] Starting OpenShell gateway"; then
-  fail "Resume reran gateway startup unexpectedly"
-else
-  pass "Resume did not rerun gateway startup"
-fi
-
-if echo "$resume_output" | grep -q "\[3/7\] Creating sandbox"; then
-  fail "Resume reran sandbox creation unexpectedly"
-else
-  pass "Resume did not rerun sandbox creation"
-fi
+# Keep semantic checks based on explicit resume markers:
+# - [resume] Skipping preflight (cached)
+# - [resume] Skipping gateway (running)
+# - [resume] Skipping sandbox (<name>)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-resume.sh` around lines 225 - 247, The tests currently
assert exact "[x/7]" step prefixes in the resume_output (the grep checks for
"\[1/7\] Preflight checks", "\[2/7\] Starting OpenShell gateway", "\[3/7\]
Creating sandbox", "\[5/7\] Setting up inference provider") which is fragile;
update each grep to match the semantic message without the hardcoded step
numbers (e.g., grep for "Preflight checks", "Starting OpenShell gateway",
"Creating sandbox", "Setting up inference provider" or use a regex that allows
an optional "[*/*]" prefix) so the assertions still verify the intended behavior
using resume_output but no longer fail on benign reordering/renumbering of
steps.
test/e2e/test-onboard-repair.sh (1)

267-277: Avoid hardcoding recorded provider/model in conflict checks.

These assertions depend on current defaults rather than the actual recorded session values, so they can fail after legitimate default changes.

💡 Suggested refactor (derive expectations from session file)
+RECORDED_PROVIDER="$(node -e 'const fs=require("fs");const s=JSON.parse(fs.readFileSync(process.argv[1],"utf8"));process.stdout.write(s.provider||"");' "$SESSION_FILE")"
+RECORDED_MODEL="$(node -e 'const fs=require("fs");const s=JSON.parse(fs.readFileSync(process.argv[1],"utf8"));process.stdout.write(s.model||"");' "$SESSION_FILE")"
+
-if echo "$provider_conflict_output" | grep -q "Resumable state recorded provider 'nvidia-nim', not 'cloud'."; then
+if echo "$provider_conflict_output" | grep -q "Resumable state recorded provider '${RECORDED_PROVIDER}', not 'cloud'."; then
   pass "Conflicting provider message is explicit"
 else
   fail "Conflicting provider message missing or incorrect"
 fi
 
-if echo "$provider_conflict_output" | grep -q "Resumable state recorded model 'nvidia/nemotron-3-super-120b-a12b', not 'nvidia/conflicting-model'."; then
+if echo "$provider_conflict_output" | grep -q "Resumable state recorded model '${RECORDED_MODEL}', not 'nvidia/conflicting-model'."; then
   pass "Conflicting model message is explicit"
 else
   fail "Conflicting model message missing or incorrect"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-repair.sh` around lines 267 - 277, The test currently
hardcodes expected provider/model strings when asserting on
provider_conflict_output; instead parse the actual recorded values from the
session (e.g., read the session JSON used in the test) and build the expected
substrings dynamically, then assert that provider_conflict_output contains those
derived messages. Concretely, in test/e2e/test-onboard-repair.sh replace the
static grep literals with values extracted from the session file (the same file
used to record resumable state) and reference those variables when checking
provider_conflict_output and model conflict messages so the assertions follow
the recorded session values rather than fixed defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-onboard-repair.sh`:
- Around line 63-80: Add a trap-backed guaranteed teardown so cleanup always
runs on early exits: at the top of the script (after
SANDBOX_NAME/OTHER_SANDBOX_NAME/SESSION_FILE are set) register a trap for EXIT
and ERR that invokes a cleanup function; implement that cleanup function to run
the same teardown commands (run_nemoclaw "$SANDBOX_NAME" destroy, run_nemoclaw
"$OTHER_SANDBOX_NAME" destroy, openshell sandbox delete for both names,
openshell forward stop 18789, openshell gateway destroy -g nemoclaw, and rm -f
"$SESSION_FILE") and ensure the trap is idempotent (safe to call multiple times)
so all exit paths no longer leak sandbox/gateway/session artifacts.

In `@test/e2e/test-onboard-resume.sh`:
- Around line 274-278: The test currently uses grep against $REGISTRY which can
match incidental substrings; change the assertion to parse the registry JSON and
verify a real sandbox entry exists. Replace the grep check with a jq-based check
that reads $REGISTRY and tests for an object with the sandbox name (e.g. jq
'any(.sandboxes[]; .name == env.SANDBOX_NAME)' or similar), and use that jq exit
status to call pass "Registry contains resumed sandbox entry" or fail "Registry
does not contain resumed sandbox entry" while keeping the same variables
($REGISTRY, $SANDBOX_NAME) and pass/fail calls.
- Around line 64-79: Define a reusable cleanup function (e.g., cleanup) that
runs the existing teardown steps (calling run_nemoclaw "$SANDBOX_NAME" destroy,
openshell sandbox delete "$SANDBOX_NAME", openshell forward stop 18789,
openshell gateway destroy -g nemoclaw and rm -f "$SESSION_FILE") and install it
as an EXIT trap near the top of the script so it always runs even if the script
aborts early; ensure the trap is registered immediately after initializing
SANDBOX_NAME/SESSION_FILE (and before any early-exit branches) and make the
cleanup function idempotent/silent on missing resources so repeated runs won’t
error.

---

Nitpick comments:
In `@test/e2e/test-onboard-repair.sh`:
- Around line 267-277: The test currently hardcodes expected provider/model
strings when asserting on provider_conflict_output; instead parse the actual
recorded values from the session (e.g., read the session JSON used in the test)
and build the expected substrings dynamically, then assert that
provider_conflict_output contains those derived messages. Concretely, in
test/e2e/test-onboard-repair.sh replace the static grep literals with values
extracted from the session file (the same file used to record resumable state)
and reference those variables when checking provider_conflict_output and model
conflict messages so the assertions follow the recorded session values rather
than fixed defaults.

In `@test/e2e/test-onboard-resume.sh`:
- Around line 225-247: The tests currently assert exact "[x/7]" step prefixes in
the resume_output (the grep checks for "\[1/7\] Preflight checks", "\[2/7\]
Starting OpenShell gateway", "\[3/7\] Creating sandbox", "\[5/7\] Setting up
inference provider") which is fragile; update each grep to match the semantic
message without the hardcoded step numbers (e.g., grep for "Preflight checks",
"Starting OpenShell gateway", "Creating sandbox", "Setting up inference
provider" or use a regex that allows an optional "[*/*]" prefix) so the
assertions still verify the intended behavior using resume_output but no longer
fail on benign reordering/renumbering of steps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a4191510-d125-41e7-88d8-a61fe074532f

📥 Commits

Reviewing files that changed from the base of the PR and between a4e2821 and 9c73c7f.

📒 Files selected for processing (3)
  • scripts/debug.sh
  • test/e2e/test-onboard-repair.sh
  • test/e2e/test-onboard-resume.sh

Comment on lines +63 to +80
SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-repair}"
OTHER_SANDBOX_NAME="${NEMOCLAW_OTHER_SANDBOX_NAME:-e2e-other}"
SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"
RESTORE_API_KEY="${NVIDIA_API_KEY:-}"

# ══════════════════════════════════════════════════════════════════
# Phase 0: Pre-cleanup
# ══════════════════════════════════════════════════════════════════
section "Phase 0: Pre-cleanup"
info "Destroying any leftover sandbox/gateway from previous runs..."
run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true
openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true
openshell forward stop 18789 2>/dev/null || true
openshell gateway destroy -g nemoclaw 2>/dev/null || true
rm -f "$SESSION_FILE"
pass "Pre-cleanup complete"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add guaranteed teardown on early exit paths.

There are multiple exit 1 branches before final cleanup, so failed runs can leak sandbox/gateway/session artifacts and poison subsequent runs.

💡 Suggested hardening (trap-backed cleanup)
 SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-repair}"
 OTHER_SANDBOX_NAME="${NEMOCLAW_OTHER_SANDBOX_NAME:-e2e-other}"
 SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"
 RESTORE_API_KEY="${NVIDIA_API_KEY:-}"
+
+cleanup() {
+  run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
+  run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true
+  openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
+  openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true
+  openshell forward stop 18789 2>/dev/null || true
+  openshell gateway destroy -g nemoclaw 2>/dev/null || true
+  rm -f "$SESSION_FILE"
+}
+
+trap cleanup EXIT
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-repair}"
OTHER_SANDBOX_NAME="${NEMOCLAW_OTHER_SANDBOX_NAME:-e2e-other}"
SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"
RESTORE_API_KEY="${NVIDIA_API_KEY:-}"
# ══════════════════════════════════════════════════════════════════
# Phase 0: Pre-cleanup
# ══════════════════════════════════════════════════════════════════
section "Phase 0: Pre-cleanup"
info "Destroying any leftover sandbox/gateway from previous runs..."
run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true
openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true
openshell forward stop 18789 2>/dev/null || true
openshell gateway destroy -g nemoclaw 2>/dev/null || true
rm -f "$SESSION_FILE"
pass "Pre-cleanup complete"
SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-repair}"
OTHER_SANDBOX_NAME="${NEMOCLAW_OTHER_SANDBOX_NAME:-e2e-other}"
SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"
RESTORE_API_KEY="${NVIDIA_API_KEY:-}"
cleanup() {
run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true
openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true
openshell forward stop 18789 2>/dev/null || true
openshell gateway destroy -g nemoclaw 2>/dev/null || true
rm -f "$SESSION_FILE"
}
trap cleanup EXIT
# ══════════════════════════════════════════════════════════════════
# Phase 0: Pre-cleanup
# ══════════════════════════════════════════════════════════════════
section "Phase 0: Pre-cleanup"
info "Destroying any leftover sandbox/gateway from previous runs..."
run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
run_nemoclaw "$OTHER_SANDBOX_NAME" destroy 2>/dev/null || true
openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
openshell sandbox delete "$OTHER_SANDBOX_NAME" 2>/dev/null || true
openshell forward stop 18789 2>/dev/null || true
openshell gateway destroy -g nemoclaw 2>/dev/null || true
rm -f "$SESSION_FILE"
pass "Pre-cleanup complete"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-repair.sh` around lines 63 - 80, Add a trap-backed
guaranteed teardown so cleanup always runs on early exits: at the top of the
script (after SANDBOX_NAME/OTHER_SANDBOX_NAME/SESSION_FILE are set) register a
trap for EXIT and ERR that invokes a cleanup function; implement that cleanup
function to run the same teardown commands (run_nemoclaw "$SANDBOX_NAME"
destroy, run_nemoclaw "$OTHER_SANDBOX_NAME" destroy, openshell sandbox delete
for both names, openshell forward stop 18789, openshell gateway destroy -g
nemoclaw, and rm -f "$SESSION_FILE") and ensure the trap is idempotent (safe to
call multiple times) so all exit paths no longer leak sandbox/gateway/session
artifacts.

Comment on lines +64 to +79
SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-resume}"
SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"
REGISTRY="$HOME/.nemoclaw/sandboxes.json"
RESTORE_API_KEY="${NVIDIA_API_KEY:-}"

# ══════════════════════════════════════════════════════════════════
# Phase 0: Pre-cleanup
# ══════════════════════════════════════════════════════════════════
section "Phase 0: Pre-cleanup"
info "Destroying any leftover sandbox/gateway from previous runs..."
run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
openshell forward stop 18789 2>/dev/null || true
openshell gateway destroy -g nemoclaw 2>/dev/null || true
rm -f "$SESSION_FILE"
pass "Pre-cleanup complete"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guarantee cleanup even when the test aborts early.

This script also exits before final cleanup in several branches, which can leave artifacts behind and destabilize later E2E runs.

💡 Suggested hardening (shared cleanup + EXIT trap)
 SANDBOX_NAME="${NEMOCLAW_SANDBOX_NAME:-e2e-resume}"
 SESSION_FILE="$HOME/.nemoclaw/onboard-session.json"
 REGISTRY="$HOME/.nemoclaw/sandboxes.json"
 RESTORE_API_KEY="${NVIDIA_API_KEY:-}"
+
+cleanup() {
+  run_nemoclaw "$SANDBOX_NAME" destroy 2>/dev/null || true
+  openshell sandbox delete "$SANDBOX_NAME" 2>/dev/null || true
+  openshell forward stop 18789 2>/dev/null || true
+  openshell gateway destroy -g nemoclaw 2>/dev/null || true
+  rm -f "$SESSION_FILE"
+}
+
+trap cleanup EXIT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-resume.sh` around lines 64 - 79, Define a reusable
cleanup function (e.g., cleanup) that runs the existing teardown steps (calling
run_nemoclaw "$SANDBOX_NAME" destroy, openshell sandbox delete "$SANDBOX_NAME",
openshell forward stop 18789, openshell gateway destroy -g nemoclaw and rm -f
"$SESSION_FILE") and install it as an EXIT trap near the top of the script so it
always runs even if the script aborts early; ensure the trap is registered
immediately after initializing SANDBOX_NAME/SESSION_FILE (and before any
early-exit branches) and make the cleanup function idempotent/silent on missing
resources so repeated runs won’t error.

Comment on lines +274 to +278
if [ -f "$REGISTRY" ] && grep -q "$SANDBOX_NAME" "$REGISTRY"; then
pass "Registry contains resumed sandbox entry"
else
fail "Registry does not contain resumed sandbox entry"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use structured JSON validation for registry assertions.

String grep on JSON can pass on incidental matches; parse the registry to verify an actual sandbox entry.

💡 Suggested assertion
-if [ -f "$REGISTRY" ] && grep -q "$SANDBOX_NAME" "$REGISTRY"; then
+if [ -f "$REGISTRY" ] && node -e '
+const fs = require("fs");
+const name = process.argv[1];
+const file = process.argv[2];
+const data = JSON.parse(fs.readFileSync(file, "utf8"));
+const entries = Array.isArray(data) ? data : Object.values(data || {});
+process.exit(entries.some((e) => e && e.name === name) ? 0 : 1);
+' "$SANDBOX_NAME" "$REGISTRY"; then
   pass "Registry contains resumed sandbox entry"
 else
   fail "Registry does not contain resumed sandbox entry"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ -f "$REGISTRY" ] && grep -q "$SANDBOX_NAME" "$REGISTRY"; then
pass "Registry contains resumed sandbox entry"
else
fail "Registry does not contain resumed sandbox entry"
fi
if [ -f "$REGISTRY" ] && node -e '
const fs = require("fs");
const name = process.argv[1];
const file = process.argv[2];
const data = JSON.parse(fs.readFileSync(file, "utf8"));
const entries = Array.isArray(data) ? data : Object.values(data || {});
process.exit(entries.some((e) => e && e.name === name) ? 0 : 1);
' "$SANDBOX_NAME" "$REGISTRY"; then
pass "Registry contains resumed sandbox entry"
else
fail "Registry does not contain resumed sandbox entry"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-onboard-resume.sh` around lines 274 - 278, The test currently
uses grep against $REGISTRY which can match incidental substrings; change the
assertion to parse the registry JSON and verify a real sandbox entry exists.
Replace the grep check with a jq-based check that reads $REGISTRY and tests for
an object with the sandbox name (e.g. jq 'any(.sandboxes[]; .name ==
env.SANDBOX_NAME)' or similar), and use that jq exit status to call pass
"Registry contains resumed sandbox entry" or fail "Registry does not contain
resumed sandbox entry" while keeping the same variables ($REGISTRY,
$SANDBOX_NAME) and pass/fail calls.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@install.sh`:
- Around line 106-133: print_done() currently prints full PATH/profile recovery
instructions when NEMOCLAW_READY_NOW=false and
NEMOCLAW_RECOVERY_EXPORT_DIR/_PROFILE are set, duplicating the same instructions
later in post_install_message(); remove the duplication by either (A) deleting
the blocks in print_done() that print the "%s$%s export PATH=..." and "source
..." lines and keep the detailed recovery instructions only in
post_install_message(), or (B) replace those blocks in print_done() with a
single brief line (e.g., "See post-install instructions below for PATH/profile
recovery") so post_install_message() remains the sole source of detailed
recovery steps; edit the print_done() function accordingly and reference
NEMOCLAW_RECOVERY_EXPORT_DIR and NEMOCLAW_RECOVERY_PROFILE conditionals when
making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 78dc209c-7d60-4d1e-85f9-0b6d8544b9fc

📥 Commits

Reviewing files that changed from the base of the PR and between 9c73c7f and 8552e01.

📒 Files selected for processing (7)
  • bin/lib/onboard-session.js
  • bin/lib/onboard.js
  • install.sh
  • test/e2e/test-onboard-repair.sh
  • test/e2e/test-onboard-resume.sh
  • test/onboard-session.test.js
  • test/onboard.test.js
✅ Files skipped from review due to trivial changes (3)
  • test/onboard-session.test.js
  • bin/lib/onboard-session.js
  • test/e2e/test-onboard-resume.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/e2e/test-onboard-repair.sh
  • bin/lib/onboard.js

Comment on lines +106 to +133
if [[ "$ONBOARD_RAN" == true ]]; then
local sandbox_name
sandbox_name="$(resolve_default_sandbox_name)"
printf " ${C_GREEN}Your OpenClaw Sandbox is live.${C_RESET}\n"
printf " ${C_DIM}Sandbox in, break things, and tell us what you find.${C_RESET}\n"
printf "\n"
printf " ${C_GREEN}Next:${C_RESET}\n"
printf " %s$%s nemoclaw %s connect\n" "$C_GREEN" "$C_RESET" "$sandbox_name"
printf " %ssandbox@%s$%s openclaw tui\n" "$C_GREEN" "$sandbox_name" "$C_RESET"
elif [[ "$NEMOCLAW_READY_NOW" == true ]]; then
printf " ${C_GREEN}NemoClaw CLI is ready in this shell.${C_RESET}\n"
printf " ${C_DIM}Onboarding has not run yet.${C_RESET}\n"
printf "\n"
printf " ${C_GREEN}Next:${C_RESET}\n"
printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"
else
printf " ${C_GREEN}NemoClaw CLI is installed.${C_RESET}\n"
printf " ${C_DIM}Onboarding did not run because this shell cannot resolve 'nemoclaw' yet.${C_RESET}\n"
printf "\n"
printf " ${C_GREEN}Next:${C_RESET}\n"
if [[ -n "$NEMOCLAW_RECOVERY_EXPORT_DIR" ]]; then
printf " %s$%s export PATH=\"%s:\$PATH\"\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_EXPORT_DIR"
fi
if [[ -n "$NEMOCLAW_RECOVERY_PROFILE" ]]; then
printf " %s$%s source %s\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_PROFILE"
fi
printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential duplicate recovery instructions.

When NEMOCLAW_READY_NOW=false and NEMOCLAW_RECOVERY_EXPORT_DIR is set, both print_done() (lines 126-132) and post_install_message() (lines 581-596) will print PATH recovery instructions, resulting in duplicate output for the user.

Consider either:

  1. Removing the recovery instructions from print_done() and relying solely on post_install_message(), or
  2. Having print_done() only print a brief message directing users to the instructions below.
🛠️ Proposed fix to avoid duplication
   else
     printf "  ${C_GREEN}NemoClaw CLI is installed.${C_RESET}\n"
     printf "  ${C_DIM}Onboarding did not run because this shell cannot resolve 'nemoclaw' yet.${C_RESET}\n"
     printf "\n"
-    printf "  ${C_GREEN}Next:${C_RESET}\n"
-    if [[ -n "$NEMOCLAW_RECOVERY_EXPORT_DIR" ]]; then
-      printf "  %s$%s export PATH=\"%s:\$PATH\"\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_EXPORT_DIR"
-    fi
-    if [[ -n "$NEMOCLAW_RECOVERY_PROFILE" ]]; then
-      printf "  %s$%s source %s\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_PROFILE"
-    fi
-    printf "  %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"
+    printf "  ${C_DIM}See recovery instructions below.${C_RESET}\n"
   fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ "$ONBOARD_RAN" == true ]]; then
local sandbox_name
sandbox_name="$(resolve_default_sandbox_name)"
printf " ${C_GREEN}Your OpenClaw Sandbox is live.${C_RESET}\n"
printf " ${C_DIM}Sandbox in, break things, and tell us what you find.${C_RESET}\n"
printf "\n"
printf " ${C_GREEN}Next:${C_RESET}\n"
printf " %s$%s nemoclaw %s connect\n" "$C_GREEN" "$C_RESET" "$sandbox_name"
printf " %ssandbox@%s$%s openclaw tui\n" "$C_GREEN" "$sandbox_name" "$C_RESET"
elif [[ "$NEMOCLAW_READY_NOW" == true ]]; then
printf " ${C_GREEN}NemoClaw CLI is ready in this shell.${C_RESET}\n"
printf " ${C_DIM}Onboarding has not run yet.${C_RESET}\n"
printf "\n"
printf " ${C_GREEN}Next:${C_RESET}\n"
printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"
else
printf " ${C_GREEN}NemoClaw CLI is installed.${C_RESET}\n"
printf " ${C_DIM}Onboarding did not run because this shell cannot resolve 'nemoclaw' yet.${C_RESET}\n"
printf "\n"
printf " ${C_GREEN}Next:${C_RESET}\n"
if [[ -n "$NEMOCLAW_RECOVERY_EXPORT_DIR" ]]; then
printf " %s$%s export PATH=\"%s:\$PATH\"\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_EXPORT_DIR"
fi
if [[ -n "$NEMOCLAW_RECOVERY_PROFILE" ]]; then
printf " %s$%s source %s\n" "$C_GREEN" "$C_RESET" "$NEMOCLAW_RECOVERY_PROFILE"
fi
printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"
fi
if [[ "$ONBOARD_RAN" == true ]]; then
local sandbox_name
sandbox_name="$(resolve_default_sandbox_name)"
printf " ${C_GREEN}Your OpenClaw Sandbox is live.${C_RESET}\n"
printf " ${C_DIM}Sandbox in, break things, and tell us what you find.${C_RESET}\n"
printf "\n"
printf " ${C_GREEN}Next:${C_RESET}\n"
printf " %s$%s nemoclaw %s connect\n" "$C_GREEN" "$C_RESET" "$sandbox_name"
printf " %ssandbox@%s$%s openclaw tui\n" "$C_GREEN" "$sandbox_name" "$C_RESET"
elif [[ "$NEMOCLAW_READY_NOW" == true ]]; then
printf " ${C_GREEN}NemoClaw CLI is ready in this shell.${C_RESET}\n"
printf " ${C_DIM}Onboarding has not run yet.${C_RESET}\n"
printf "\n"
printf " ${C_GREEN}Next:${C_RESET}\n"
printf " %s$%s nemoclaw onboard\n" "$C_GREEN" "$C_RESET"
else
printf " ${C_GREEN}NemoClaw CLI is installed.${C_RESET}\n"
printf " ${C_DIM}Onboarding did not run because this shell cannot resolve 'nemoclaw' yet.${C_RESET}\n"
printf "\n"
printf " ${C_DIM}See recovery instructions below.${C_RESET}\n"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@install.sh` around lines 106 - 133, print_done() currently prints full
PATH/profile recovery instructions when NEMOCLAW_READY_NOW=false and
NEMOCLAW_RECOVERY_EXPORT_DIR/_PROFILE are set, duplicating the same instructions
later in post_install_message(); remove the duplication by either (A) deleting
the blocks in print_done() that print the "%s$%s export PATH=..." and "source
..." lines and keep the detailed recovery instructions only in
post_install_message(), or (B) replace those blocks in print_done() with a
single brief line (e.g., "See post-install instructions below for PATH/profile
recovery") so post_install_message() remains the sole source of detailed
recovery steps; edit the print_done() function accordingly and reference
NEMOCLAW_RECOVERY_EXPORT_DIR and NEMOCLAW_RECOVERY_PROFILE conditionals when
making the change.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
bin/lib/onboard.js (2)

2569-2594: ⚠️ Potential issue | 🟠 Major

Hydrate stored credentials when resuming provider selection.

When resumeProviderSelection is true, credentialEnv is restored from the session (e.g., "NVIDIA_API_KEY"), but the actual secret value is never loaded into process.env. setupInference() at Line 2598 builds provider env from process.env[credentialEnv], which will be undefined for resumed remote-provider sessions unless the user re-exported the key in their current shell.

Proposed fix
   if (resumeProviderSelection) {
+    if (credentialEnv && !process.env[credentialEnv]) {
+      const savedCredential = getCredential(credentialEnv);
+      if (savedCredential) {
+        process.env[credentialEnv] = savedCredential;
+      }
+    }
     resumeStepMessage("provider selection", `${provider} / ${model}`);
   } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2569 - 2594, When resumeProviderSelection is
true you restore credentialEnv (the env var name) but not the secret value
itself, so before calling setupInference / continuing the flow set
process.env[credentialEnv] to the stored secret from the session (e.g.
session.steps.provider_selection.<credentialValueField>); update the resume
branch that uses resumeStepMessage to also read the provider_selection step
payload (the stored credential value) and assign it into
process.env[credentialEnv] so setupInference sees the secret when using
process.env[credentialEnv].

240-246: ⚠️ Potential issue | 🟠 Major

Use runOpenshell() for sandbox repair commands.

Similar to the previous issue, repairRecordedSandbox uses raw openshell commands via run() instead of the resolved-binary wrapper.

Proposed fix
 function repairRecordedSandbox(sandboxName) {
   if (!sandboxName) return;
   note(`  [resume] Cleaning up recorded sandbox '${sandboxName}' before recreating it.`);
-  run(`openshell forward stop 18789 2>/dev/null || true`, { ignoreError: true });
-  run(`openshell sandbox delete "${sandboxName}" 2>/dev/null || true`, { ignoreError: true });
+  runOpenshell(["forward", "stop", "18789"], { ignoreError: true });
+  runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true });
   registry.removeSandbox(sandboxName);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 240 - 246, The repairRecordedSandbox
function currently calls raw commands via run(...) — replace those calls with
the resolved-binary wrapper runOpenshell to ensure the correct openshell binary
is used: in repairRecordedSandbox, change run(`openshell forward stop 18789
2>/dev/null || true`, { ignoreError: true }) to use runOpenshell('forward',
['stop','18789'], { ignoreError: true }) (or equivalent API your runOpenshell
exposes), and change run(`openshell sandbox delete "${sandboxName}" 2>/dev/null
|| true`, { ignoreError: true }) to runOpenshell('sandbox', ['delete',
sandboxName], { ignoreError: true }); keep the note(...) and
registry.removeSandbox(sandboxName) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 233-238: The getSandboxReuseState function currently calls
runCapture to invoke openshell which can fail if openshell is in ~/.local/bin
and PATH wasn't reloaded; change both calls that set getOutput and listOutput to
use runCaptureOpenshell instead of runCapture so openshell is resolved reliably,
keeping the same arguments (including ignoreError: true) and then pass those
outputs into getSandboxStateFromOutputs as before; update references to
getOutput and listOutput in getSandboxReuseState accordingly.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 2569-2594: When resumeProviderSelection is true you restore
credentialEnv (the env var name) but not the secret value itself, so before
calling setupInference / continuing the flow set process.env[credentialEnv] to
the stored secret from the session (e.g.
session.steps.provider_selection.<credentialValueField>); update the resume
branch that uses resumeStepMessage to also read the provider_selection step
payload (the stored credential value) and assign it into
process.env[credentialEnv] so setupInference sees the secret when using
process.env[credentialEnv].
- Around line 240-246: The repairRecordedSandbox function currently calls raw
commands via run(...) — replace those calls with the resolved-binary wrapper
runOpenshell to ensure the correct openshell binary is used: in
repairRecordedSandbox, change run(`openshell forward stop 18789 2>/dev/null ||
true`, { ignoreError: true }) to use runOpenshell('forward', ['stop','18789'], {
ignoreError: true }) (or equivalent API your runOpenshell exposes), and change
run(`openshell sandbox delete "${sandboxName}" 2>/dev/null || true`, {
ignoreError: true }) to runOpenshell('sandbox', ['delete', sandboxName], {
ignoreError: true }); keep the note(...) and registry.removeSandbox(sandboxName)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0cbb97c6-c79d-4ab2-9330-5b2eb2d6e1b7

📥 Commits

Reviewing files that changed from the base of the PR and between 8552e01 and b5d4f85.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

@kjw3
Copy link
Copy Markdown
Contributor Author

kjw3 commented Mar 26, 2026

Superseded by the signed-commit replacement PR created from branch . Closing this one to keep review on the replacement branch.

@kjw3 kjw3 closed this Mar 26, 2026
@kjw3 kjw3 deleted the fix/446-installer-resiliency branch March 26, 2026 05:25
@cv
Copy link
Copy Markdown
Contributor

cv commented Mar 27, 2026

New PR: #961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Getting Started Use this label to identify setup, installation, or onboarding issues. good first issue Good for newcomers priority: medium Issue that should be addressed in upcoming releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants