Conversation
* fix: improve gateway lifecycle recovery * docs: fix readme markdown list spacing * fix: tighten gateway lifecycle review follow-ups * fix: simplify tokenized control ui output * fix: restore chat route in control ui urls * refactor: simplify ansi stripping in onboard * fix: shorten control ui url output * fix: move control ui below cli next steps
Fixes NVIDIA#949 Co-authored-by: KJ <kejones@nvidia.com>
There was a problem hiding this comment.
Sorry @AgentWOPR, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
📝 WalkthroughWalkthroughThe changes introduce gateway health checking and intelligent reuse logic to optimize NemoClaw onboarding. New helper functions analyze gateway connectivity and metadata to decide whether to reuse existing gateways or rebuild them. Sandbox commands now validate gateway liveness before executing, supported by reconciliation logic that classifies gateway and sandbox states and drives recovery flows. Changes
Sequence DiagramssequenceDiagram
participant Client as CLI Client
participant Preflight as preflight()
participant Status as openshell status
participant GwInfo as openshell gateway info
participant Selector as openshell gateway select
participant Dashboard as Dashboard Server
Client->>Preflight: startGateway(_gpu)
Preflight->>Status: Query gateway status
Status-->>Preflight: statusOutput
Preflight->>GwInfo: Query gateway metadata
GwInfo-->>Preflight: gwInfoOutput
Preflight->>Preflight: isGatewayHealthy(statusOutput, gwInfoOutput)
alt Gateway is healthy
Preflight->>Selector: select nemoclaw
Selector-->>Preflight: ✓ selected
Preflight->>Preflight: Set OPENSHELL_GATEWAY env
Preflight-->>Client: ✓ Reuse existing gateway
else Gateway is stale/missing
Preflight->>Preflight: Destroy stale state
Preflight->>Dashboard: Stop port forward :18789
Preflight->>Preflight: startGateway() with full init
Preflight-->>Client: ✓ New gateway started
end
sequenceDiagram
participant Client as CLI Client
participant SandboxCmd as sandboxConnect/sandboxStatus
participant Ensure as ensureLiveSandboxOrExit()
participant OpenshellGW as openshell gateway get
participant Registry as NemoClaw Registry
participant Reconcile as Gateway Reconciliation
participant Recover as Recovery Flow
Client->>SandboxCmd: Run sandbox command
SandboxCmd->>Ensure: Validate sandbox target
Ensure->>OpenshellGW: Query active gateway
OpenshellGW-->>Ensure: currentGateway
Ensure->>Registry: Look up registeredGateway
Registry-->>Ensure: gwState from registry
Ensure->>Reconcile: Classify state (healthy/unhealthy/missing)
alt Gateway healthy & matches target
Reconcile-->>Ensure: healthy_named ✓
Ensure-->>SandboxCmd: Proceed
else Gateway unreachable/mismatched
Reconcile->>Recover: Attempt recovery
Recover->>Recover: startGatewayForRecovery()
Recover-->>Ensure: recovery attempted
Ensure->>Reconcile: Re-classify after recovery
alt Recovery successful
Reconcile-->>Ensure: healthy_named ✓
else Recovery failed
Ensure-->>Client: ✗ Exit with guidance
end
else Sandbox missing from gateway
Reconcile->>Registry: Remove stale entry
Registry-->>Ensure: ✓ cleaned
Ensure-->>Client: ✗ Sandbox missing
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review Summary by QodoImprove gateway lifecycle recovery and sandbox state reconciliation
WalkthroughsDescription• Improve gateway lifecycle recovery with health checks and reuse logic • Add comprehensive gateway state reconciliation for sandbox operations • Refactor CLI sandbox commands to handle gateway errors gracefully • Clean up staged build artifacts and fix ulimit ordering in startup script Diagramflowchart LR
A["Gateway Status Check"] -->|healthy| B["Reuse Existing Gateway"]
A -->|stale| C["Destroy & Recreate"]
A -->|missing| D["Create New Gateway"]
B --> E["Skip Port Check"]
C --> F["Verify Health"]
D --> F
F -->|healthy| G["Sandbox Ready"]
F -->|failed| H["Error Recovery"]
H -->|recoverable| I["Retry Gateway Start"]
H -->|unrecoverable| J["Exit with Guidance"]
File Changes1. bin/lib/onboard.js
|
Code Review by Qodo
1. Stale gateway skips volume cleanup
|
| if command -v nemoclaw >/dev/null 2>&1; then | ||
| nemoclaw "$SANDBOX_A" destroy --yes 2>/dev/null || true | ||
| nemoclaw "$SANDBOX_B" destroy --yes 2>/dev/null || true | ||
| if [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then |
There was a problem hiding this comment.
🟡 Medium e2e/test-double-onboard.sh:179
The pre-cleanup guard at line 179 uses [ -x "$REPO_ROOT/bin/nemoclaw.js" ] (executable), but the NEMOCLAW_CMD assignment at line 60 only checks [ -f "$REPO_ROOT/bin/nemoclaw.js" ] (exists). When node is available and the .js file exists without execute permission, NEMOCLAW_CMD is set correctly but the cleanup condition fails, skipping destruction of stale sandboxes and causing "Sandbox already exists" errors on subsequent runs. Change line 179 from -x to -f to match the condition at line 60.
-if [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then
+if [ -f "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file test/e2e/test-double-onboard.sh around line 179:
The pre-cleanup guard at line 179 uses `[ -x "$REPO_ROOT/bin/nemoclaw.js" ]` (executable), but the `NEMOCLAW_CMD` assignment at line 60 only checks `[ -f "$REPO_ROOT/bin/nemoclaw.js" ]` (exists). When `node` is available and the `.js` file exists without execute permission, `NEMOCLAW_CMD` is set correctly but the cleanup condition fails, skipping destruction of stale sandboxes and causing "Sandbox already exists" errors on subsequent runs. Change line 179 from `-x` to `-f` to match the condition at line 60.
Greptile SummaryThis upstream sync introduces gateway lifecycle awareness across the NemoClaw CLI. The primary change teaches Key changes:
Confidence Score: 4/5Safe to merge after fixing the ANSI stripping inconsistency in isGatewayHealthy; all other changes are well-tested and low-risk. The PR is substantive and well-covered by tests — 9 new unit tests, an improved e2e suite, and a corrected --follow flag. One targeted fix is needed: the connected check in isGatewayHealthy() must call stripAnsi() to match the approach used everywhere else in this PR. Without it, a coloured openshell status line can silently cause healthy-gateway teardown on every onboard run. The duplicate stripAnsi implementations are a maintenance concern but not a blocker. bin/lib/onboard.js — specifically the isGatewayHealthy function's connected check around line 25.
|
| Filename | Overview |
|---|---|
| bin/lib/onboard.js | Adds gateway-reuse logic (isGatewayHealthy, getActiveGatewayName, stripAnsi), refactors startGateway into startGatewayWithOptions; connected check in isGatewayHealthy skips ANSI stripping — P1 logic issue that can cause healthy gateway teardown. |
| bin/nemoclaw.js | Adds rich gateway lifecycle state machine (recoverNamedGatewayRuntime, getReconciledSandboxGatewayState, ensureLiveSandboxOrExit); sandbox connect/status made async; duplicates stripAnsi/getActiveGatewayName with a narrower regex than onboard.js. |
| test/cli.test.js | Adds 9 new integration tests covering gateway lifecycle states, ANSI-decorated errors, stale registry cleanup, and --follow log streaming; thorough and well-structured. |
| test/e2e/test-double-onboard.sh | Rewrites e2e test to use a local fake OpenAI endpoint, adds phases for gateway reuse, multi-sandbox coexistence, stale registry reconciliation, and lifecycle response after gateway stop. |
| test/onboard.test.js | Adds unit tests for isGatewayHealthy and gateway-reuse path in startGateway; no ANSI test cases for isGatewayHealthy, leaving the onboard.js ANSI inconsistency undetected. |
| scripts/clean-staged-tree.sh | New helper script that removes Python build artefacts (.venv, .pytest_cache, pycache) from a staged blueprint directory before sandbox build. |
| scripts/nemoclaw-start.sh | Swaps ulimit order to set soft limit before hard limit; functionally equivalent for the 512-process cap use-case. |
| scripts/setup.sh | Calls clean-staged-tree.sh after copying the nemoclaw-blueprint to the build context; straightforward and low-risk. |
| test/gateway-cleanup.test.js | Updates assertion to match startGatewayWithOptions refactor; lowers expected destroyGateway call count from ≥3 to ≥2, consistent with the new conditional stale-cleanup logic. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[nemoclaw onboard / connect / status] --> B[getReconciledSandboxGatewayState]
B --> C[getSandboxGatewayState\nopenshell sandbox get]
C -->|present| D[✅ Return present]
C -->|missing\nNotFound| E[✅ Return missing\nremove stale registry]
C -->|gateway_error\nConnection refused /\nhandshake failed /\nauth token missing| F[recoverNamedGatewayRuntime]
F --> G[getNamedGatewayLifecycleState]
G -->|healthy_named| H[Return recovered=true]
G -->|named_unhealthy\nnamed_unreachable\nconnected_other| I[startGatewayForRecovery\nexitOnFailure=false]
G -->|missing_named| J[gateway select nemoclaw\nre-check state]
I -->|success or throws| K[gateway select nemoclaw\nre-check state]
K -->|healthy_named| H
K -->|still unhealthy| L[Recovery failed]
J -->|still missing| L
H --> M[Retry getSandboxGatewayState]
M -->|present / missing| N[✅ Return with recoveredGateway=true]
M -->|handshake failed| O[Return identity_drift]
L --> Q[getNamedGatewayLifecycleState\nclassify final state]
Q -->|No gateway configured| R[gateway_missing_after_restart]
Q -->|Connection refused| S[gateway_unreachable_after_restart]
Q -->|other| T[Return gateway_error\ngatewayRecoveryFailed=true]
D & E & N & O & R & S & T --> U[sandboxConnect / sandboxStatus\nprint guidance or proceed]
Comments Outside Diff (2)
-
bin/lib/onboard.js, line 25-29 (link)isGatewayHealthyskips ANSI stripping on theconnectedcheckgetActiveGatewayName()correctly callsstripAnsi()before pattern-matching, but theconnectedflag reads the rawstatusOutputwithout stripping. If openshell emits a coloured status line likeStatus: \x1b[32mConnected\x1b[0m,statusOutput.includes("Connected")returnsfalseeven though the gateway is live.This matters because
isGatewayHealthy()gates the "reuse vs destroy" decision in bothpreflight()andstartGatewayWithOptions(). A falsefalsefunnels execution into theelse if (hasStaleGateway(gwInfo))branch, which destroys a perfectly healthy gateway.The companion implementation in
nemoclaw.js(getNamedGatewayLifecycleState()) already appliesstripAnsi()before the Connected check — this one should do the same.Prompt To Fix With AI
This is a comment left during a code review. Path: bin/lib/onboard.js Line: 25-29 Comment: **`isGatewayHealthy` skips ANSI stripping on the `connected` check** `getActiveGatewayName()` correctly calls `stripAnsi()` before pattern-matching, but the `connected` flag reads the raw `statusOutput` without stripping. If openshell emits a coloured status line like `Status: \x1b[32mConnected\x1b[0m`, `statusOutput.includes("Connected")` returns `false` even though the gateway is live. This matters because `isGatewayHealthy()` gates the "reuse vs destroy" decision in both `preflight()` and `startGatewayWithOptions()`. A false `false` funnels execution into the `else if (hasStaleGateway(gwInfo))` branch, which destroys a perfectly healthy gateway. The companion implementation in `nemoclaw.js` (`getNamedGatewayLifecycleState()`) already applies `stripAnsi()` before the Connected check — this one should do the same. How can I resolve this? If you propose a fix, please make it concise.
-
bin/nemoclaw.js, line 301-313 (link)Divergent
stripAnsi/getActiveGatewayNameduplicates across modulesstripAnsi()andgetActiveGatewayName()are now defined in bothbin/lib/onboard.jsandbin/nemoclaw.jswith subtly different regex patterns:onboard.js:/\x1b\[[0-9;]*[A-Za-z]/g— strips any CSI sequence (cursor movement, erase, etc.)nemoclaw.js:/\x1b\[[0-9;]*m/g— strips only SGR (colour/style) sequences
This means gateway-state matching in the two files can diverge if openshell emits non-SGR escape sequences. Consider extracting both helpers to
bin/lib/ansi.js(orbin/lib/openshell-util.js) and importing from both call-sites to keep the behaviour consistent.Prompt To Fix With AI
This is a comment left during a code review. Path: bin/nemoclaw.js Line: 301-313 Comment: **Divergent `stripAnsi` / `getActiveGatewayName` duplicates across modules** `stripAnsi()` and `getActiveGatewayName()` are now defined in both `bin/lib/onboard.js` and `bin/nemoclaw.js` with subtly different regex patterns: - `onboard.js`: `/\x1b\[[0-9;]*[A-Za-z]/g` — strips any CSI sequence (cursor movement, erase, etc.) - `nemoclaw.js`: `/\x1b\[[0-9;]*m/g` — strips only SGR (colour/style) sequences This means gateway-state matching in the two files can diverge if openshell emits non-SGR escape sequences. Consider extracting both helpers to `bin/lib/ansi.js` (or `bin/lib/openshell-util.js`) and importing from both call-sites to keep the behaviour consistent. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: bin/lib/onboard.js
Line: 25-29
Comment:
**`isGatewayHealthy` skips ANSI stripping on the `connected` check**
`getActiveGatewayName()` correctly calls `stripAnsi()` before pattern-matching, but the `connected` flag reads the raw `statusOutput` without stripping. If openshell emits a coloured status line like `Status: \x1b[32mConnected\x1b[0m`, `statusOutput.includes("Connected")` returns `false` even though the gateway is live.
This matters because `isGatewayHealthy()` gates the "reuse vs destroy" decision in both `preflight()` and `startGatewayWithOptions()`. A false `false` funnels execution into the `else if (hasStaleGateway(gwInfo))` branch, which destroys a perfectly healthy gateway.
The companion implementation in `nemoclaw.js` (`getNamedGatewayLifecycleState()`) already applies `stripAnsi()` before the Connected check — this one should do the same.
```suggestion
function isGatewayHealthy(statusOutput = "", gwInfoOutput = "") {
const cleanStatus = stripAnsi(statusOutput);
const connected = typeof cleanStatus === "string" && cleanStatus.includes("Connected");
const activeGateway = getActiveGatewayName(statusOutput);
return connected && activeGateway === GATEWAY_NAME && hasStaleGateway(gwInfoOutput);
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: bin/nemoclaw.js
Line: 301-313
Comment:
**Divergent `stripAnsi` / `getActiveGatewayName` duplicates across modules**
`stripAnsi()` and `getActiveGatewayName()` are now defined in both `bin/lib/onboard.js` and `bin/nemoclaw.js` with subtly different regex patterns:
- `onboard.js`: `/\x1b\[[0-9;]*[A-Za-z]/g` — strips any CSI sequence (cursor movement, erase, etc.)
- `nemoclaw.js`: `/\x1b\[[0-9;]*m/g` — strips only SGR (colour/style) sequences
This means gateway-state matching in the two files can diverge if openshell emits non-SGR escape sequences. Consider extracting both helpers to `bin/lib/ansi.js` (or `bin/lib/openshell-util.js`) and importing from both call-sites to keep the behaviour consistent.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/nemoclaw.js`:
- Around line 158-170: The regex checks in getSandboxGatewayState are running
against raw, ANSI-decorated output so matches like "NotFound" or "Missing
gateway auth token" can be missed; update getSandboxGatewayState (which calls
captureOpenshell) to strip ANSI escape sequences (and optionally normalize case)
before running the regex tests—e.g., create a cleanedOutput by removing ANSI
control sequences from result.output (keep original output for return) and run
the three regex tests against cleanedOutput so gateway_error/missing/present are
correctly classified.
- Around line 209-210: The code prematurely returns when lookup.state ===
"missing"; change the logic so NotFound/“missing” is only treated as
authoritative after confirming NemoClaw is the selected and healthy gateway
(e.g., check the active gateway via the selection function you use—such as
getActiveGateway()/selectGateway()—and/or an isGatewayHealthy() check) before
returning lookup; in practice, wrap the existing "if (lookup.state ===
'missing') return lookup;" with a condition that currentGateway === 'nemoclaw'
&& isNemoClawHealthy() (or move the missing-state handling to after gateway
selection/health verification) so other gateways' legitimate "not found"
responses don't cause deletion by status/connect.
In `@scripts/clean-staged-tree.sh`:
- Around line 14-15: The current cleanup only removes top-level .venv and
.pytest_cache; update the cleanup to recursively find and remove any directories
named ".venv" or ".pytest_cache" under the staged tree by replacing the
non-recursive rm invocation with a find-based removal that searches
"$target_dir" for directories matching those names and deletes them (preserve
the existing 2>/dev/null || true behavior to ignore errors); reference the
variables and patterns in the script (target_dir, ".venv", ".pytest_cache", and
the existing find usage for "__pycache__") to locate and change the commands.
In `@test/e2e/test-double-onboard.sh`:
- Around line 179-181: The CLI availability check uses "-x" which requires the
exec bit but NEMOCLAW_CMD treats a present file as usable; change the predicate
to match NEMOCLAW_CMD by checking file existence (-f) or the `command -v` path.
Replace occurrences of [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v
nemoclaw >/dev/null 2>&1 with [ -f "$REPO_ROOT/bin/nemoclaw.js" ] || command -v
nemoclaw >/dev/null 2>&1 in both the Phase 0 block (where run_nemoclaw
"$SANDBOX_A" destroy … and run_nemoclaw "$SANDBOX_B" destroy … are called) and
the similar checks around the later lines (208-213) so the same predicate is
used everywhere.
In `@test/onboard.test.js`:
- Around line 521-526: The test currently asserts commands.length === 1 which is
brittle; change the assertions around the parsed commands (the commands
variable) to stop requiring exact length and instead assert that at least one
command matches the select for "nemoclaw" and that none match gateway 'destroy'
or 'start'. Concretely, replace or remove assert.equal(commands.length, 1) and
use an existence check (e.g., assert.ok(commands.some(c => /gateway' 'select'
'nemoclaw'/.test(c)))) while keeping the two assert.doesNotMatch checks to
ensure no 'destroy'/'start' commands are present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5b328ffc-b5a0-4d0f-8af1-171c47fa2aca
📒 Files selected for processing (9)
bin/lib/onboard.jsbin/nemoclaw.jsscripts/clean-staged-tree.shscripts/nemoclaw-start.shscripts/setup.shtest/cli.test.jstest/e2e/test-double-onboard.shtest/gateway-cleanup.test.jstest/onboard.test.js
| function getSandboxGatewayState(sandboxName) { | ||
| const result = captureOpenshell(["sandbox", "get", sandboxName]); | ||
| const output = result.output; | ||
| if (result.status === 0) { | ||
| return { state: "present", output }; | ||
| } | ||
| if (/NotFound|sandbox not found/i.test(output)) { | ||
| return { state: "missing", output }; | ||
| } | ||
| if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(output)) { | ||
| return { state: "gateway_error", output }; | ||
| } | ||
| return { state: "unknown_error", output }; |
There was a problem hiding this comment.
Classify sandbox get failures on stripped output.
These regexes inspect raw output. ANSI-decorated NotFound, transport error, or Missing gateway auth token currently fall through to unknown_error, which skips both recovery and stale-entry reconciliation.
Suggested fix
function getSandboxGatewayState(sandboxName) {
const result = captureOpenshell(["sandbox", "get", sandboxName]);
const output = result.output;
+ const cleanOutput = stripAnsi(output);
if (result.status === 0) {
return { state: "present", output };
}
- if (/NotFound|sandbox not found/i.test(output)) {
+ if (/NotFound|sandbox not found/i.test(cleanOutput)) {
return { state: "missing", output };
}
- if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(output)) {
+ if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(cleanOutput)) {
return { state: "gateway_error", output };
}
return { state: "unknown_error", output };
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function getSandboxGatewayState(sandboxName) { | |
| const result = captureOpenshell(["sandbox", "get", sandboxName]); | |
| const output = result.output; | |
| if (result.status === 0) { | |
| return { state: "present", output }; | |
| } | |
| if (/NotFound|sandbox not found/i.test(output)) { | |
| return { state: "missing", output }; | |
| } | |
| if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(output)) { | |
| return { state: "gateway_error", output }; | |
| } | |
| return { state: "unknown_error", output }; | |
| function getSandboxGatewayState(sandboxName) { | |
| const result = captureOpenshell(["sandbox", "get", sandboxName]); | |
| const output = result.output; | |
| const cleanOutput = stripAnsi(output); | |
| if (result.status === 0) { | |
| return { state: "present", output }; | |
| } | |
| if (/NotFound|sandbox not found/i.test(cleanOutput)) { | |
| return { state: "missing", output }; | |
| } | |
| if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(cleanOutput)) { | |
| return { state: "gateway_error", output }; | |
| } | |
| return { state: "unknown_error", output }; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/nemoclaw.js` around lines 158 - 170, The regex checks in
getSandboxGatewayState are running against raw, ANSI-decorated output so matches
like "NotFound" or "Missing gateway auth token" can be missed; update
getSandboxGatewayState (which calls captureOpenshell) to strip ANSI escape
sequences (and optionally normalize case) before running the regex tests—e.g.,
create a cleanedOutput by removing ANSI control sequences from result.output
(keep original output for return) and run the three regex tests against
cleanedOutput so gateway_error/missing/present are correctly classified.
| if (lookup.state === "missing") { | ||
| return lookup; |
There was a problem hiding this comment.
Don’t treat NotFound as authoritative before selecting NemoClaw.
This returns missing before you verify that nemoclaw is the active/healthy gateway. If another gateway is selected, openshell sandbox get <name> can legitimately say “not found”, and status/connect will delete a sandbox that still exists on NemoClaw.
Suggested fix
- if (lookup.state === "missing") {
- return lookup;
- }
+ if (lookup.state === "missing") {
+ const lifecycle = getNamedGatewayLifecycleState();
+ if (lifecycle.state !== "healthy_named") {
+ const recovery = await recoverNamedGatewayRuntime();
+ const retried = getSandboxGatewayState(sandboxName);
+ if (retried.state !== "missing") {
+ return { ...retried, recoveredGateway: recovery.recovered, recoveryVia: recovery.via || null };
+ }
+ }
+ return lookup;
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/nemoclaw.js` around lines 209 - 210, The code prematurely returns when
lookup.state === "missing"; change the logic so NotFound/“missing” is only
treated as authoritative after confirming NemoClaw is the selected and healthy
gateway (e.g., check the active gateway via the selection function you use—such
as getActiveGateway()/selectGateway()—and/or an isGatewayHealthy() check) before
returning lookup; in practice, wrap the existing "if (lookup.state ===
'missing') return lookup;" with a condition that currentGateway === 'nemoclaw'
&& isNemoClawHealthy() (or move the missing-state handling to after gateway
selection/health verification) so other gateways' legitimate "not found"
responses don't cause deletion by status/connect.
| rm -rf "$target_dir/.venv" "$target_dir/.pytest_cache" | ||
| find "$target_dir" -type d -name __pycache__ -prune -exec rm -rf {} + 2>/dev/null || true |
There was a problem hiding this comment.
Make .venv and .pytest_cache cleanup recursive too.
Right now only the root-level copies are removed. Nested virtualenvs and pytest caches under the staged tree will still get copied into the Docker context.
Suggested fix
-rm -rf "$target_dir/.venv" "$target_dir/.pytest_cache"
-find "$target_dir" -type d -name __pycache__ -prune -exec rm -rf {} + 2>/dev/null || true
+find "$target_dir" -type d \( -name .venv -o -name .pytest_cache -o -name __pycache__ \) -prune -exec rm -rf -- {} + 2>/dev/null || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rm -rf "$target_dir/.venv" "$target_dir/.pytest_cache" | |
| find "$target_dir" -type d -name __pycache__ -prune -exec rm -rf {} + 2>/dev/null || true | |
| find "$target_dir" -type d \( -name .venv -o -name .pytest_cache -o -name __pycache__ \) -prune -exec rm -rf -- {} + 2>/dev/null || true |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/clean-staged-tree.sh` around lines 14 - 15, The current cleanup only
removes top-level .venv and .pytest_cache; update the cleanup to recursively
find and remove any directories named ".venv" or ".pytest_cache" under the
staged tree by replacing the non-recursive rm invocation with a find-based
removal that searches "$target_dir" for directories matching those names and
deletes them (preserve the existing 2>/dev/null || true behavior to ignore
errors); reference the variables and patterns in the script (target_dir,
".venv", ".pytest_cache", and the existing find usage for "__pycache__") to
locate and change the commands.
| if [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then | ||
| run_nemoclaw "$SANDBOX_A" destroy --yes 2>/dev/null || true | ||
| run_nemoclaw "$SANDBOX_B" destroy --yes 2>/dev/null || true |
There was a problem hiding this comment.
Align the CLI availability checks with NEMOCLAW_CMD.
NEMOCLAW_CMD runs bin/nemoclaw.js via node when the file merely exists, but these checks require the exec bit. In a repo checkout without that bit, Phase 1 fails early and Phase 0 skips cleanup even though the command is usable.
Suggested fix
-if [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then
+if { command -v node >/dev/null 2>&1 && [ -f "$REPO_ROOT/bin/nemoclaw.js" ]; } || command -v nemoclaw >/dev/null 2>&1; thenApply the same predicate in both Phase 0 and Phase 1.
Also applies to: 208-213
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-double-onboard.sh` around lines 179 - 181, The CLI availability
check uses "-x" which requires the exec bit but NEMOCLAW_CMD treats a present
file as usable; change the predicate to match NEMOCLAW_CMD by checking file
existence (-f) or the `command -v` path. Replace occurrences of [ -x
"$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1 with [ -f
"$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1 in both
the Phase 0 block (where run_nemoclaw "$SANDBOX_A" destroy … and run_nemoclaw
"$SANDBOX_B" destroy … are called) and the similar checks around the later lines
(208-213) so the same predicate is used everywhere.
| assert.equal(result.status, 0, result.stderr); | ||
| const commands = JSON.parse(result.stdout.trim().split("\n").pop()); | ||
| assert.equal(commands.length, 1); | ||
| assert.match(commands[0], /gateway' 'select' 'nemoclaw'/); | ||
| assert.doesNotMatch(commands[0], /gateway' 'destroy'/); | ||
| assert.doesNotMatch(commands[0], /gateway' 'start'/); |
There was a problem hiding this comment.
Relax the one-command assertion in the gateway-reuse test.
The healthy reuse path currently stops the 18789 forward before selecting nemoclaw, so commands.length === 1 will fail even when reuse is working.
Suggested fix
- assert.equal(commands.length, 1);
- assert.match(commands[0], /gateway' 'select' 'nemoclaw'/);
- assert.doesNotMatch(commands[0], /gateway' 'destroy'/);
- assert.doesNotMatch(commands[0], /gateway' 'start'/);
+ assert.ok(commands.some((command) => /gateway' 'select' 'nemoclaw'/.test(command)));
+ assert.doesNotMatch(commands.join("\n"), /gateway' 'destroy'/);
+ assert.doesNotMatch(commands.join("\n"), /gateway' 'start'/);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/onboard.test.js` around lines 521 - 526, The test currently asserts
commands.length === 1 which is brittle; change the assertions around the parsed
commands (the commands variable) to stop requiring exact length and instead
assert that at least one command matches the select for "nemoclaw" and that none
match gateway 'destroy' or 'start'. Concretely, replace or remove
assert.equal(commands.length, 1) and use an existence check (e.g.,
assert.ok(commands.some(c => /gateway' 'select' 'nemoclaw'/.test(c)))) while
keeping the two assert.doesNotMatch checks to ensure no 'destroy'/'start'
commands are present.
TSavo
left a comment
There was a problem hiding this comment.
Upstream sync. Unit tests pass. Qodo findings are upstream edge cases — would conflict on next sync if fixed here. No branch protection.
Automated upstream sync
Rebased our WOPR sidecar commits onto upstream/main (NVIDIA/NemoClaw).
What this does
Verify
Note
Add gateway lifecycle recovery and sandbox reconciliation to NemoClaw CLI
isGatewayHealthy,startGatewayWithOptions, andstartGatewayForRecoveryto bin/lib/onboard.js so preflight and gateway start reuse an already-healthy named gateway instead of always recreating it.getNamedGatewayLifecycleState,recoverNamedGatewayRuntime, andgetReconciledSandboxGatewayStateto bin/nemoclaw.js to classify gateway state, attempt recovery, and reconcile stale registry entries before connecting or reporting status.nemoclaw <name> connectnow validates live sandbox presence viaensureLiveSandboxOrExitand starts the port-forward for 18789 reliably before connecting.nemoclaw <name> statusreflects gateway lifecycle conditions and removes stale registry entries only when the sandbox is confirmed missing.--followflag passthrough innemoclaw <name> logs(was incorrectly passed as--tail)..venv,.pytest_cache, and__pycache__from the staged build context.📊 Macroscope summarized f61092b. 9 files reviewed, 2 issues evaluated, 1 issue filtered, 1 comment posted
🗂️ Filtered Issues
bin/lib/onboard.js — 0 comments posted, 1 evaluated, 1 filtered
selected.key === "nim-local"andmodels.length === 0, the code logs"No NIM models fit your GPU VRAM. Falling back to cloud API."but then immediately executesbreakat line 1869 without actually setting up the cloud API fallback. The function returns withmodel = null,preferredInferenceApi = null, and the default provider settings, but no API key validation or proper cloud setup occurs. The message promises a fallback that never happens. [ Out of scope ]Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests