Skip to content

sync: rebase on upstream (2026-03-26)#7

Merged
TSavo merged 3 commits intomainfrom
sync/upstream-2026-03-26
Mar 26, 2026
Merged

sync: rebase on upstream (2026-03-26)#7
TSavo merged 3 commits intomainfrom
sync/upstream-2026-03-26

Conversation

@AgentWOPR
Copy link

@AgentWOPR AgentWOPR commented Mar 26, 2026

Automated upstream sync

Rebased our WOPR sidecar commits onto upstream/main (NVIDIA/NemoClaw).

What this does

  • Pulls in latest upstream changes (security fixes, features, CI improvements)
  • Resolves any rebase conflicts (preserving wopr/ sidecar)
  • Verifies sidecar + Dockerfile integrity

Verify

  • Build passes
  • wopr/sidecar.js intact
  • Dockerfile includes sidecar setup

Open with Devin

Note

Add gateway lifecycle recovery and sandbox reconciliation to NemoClaw CLI

  • Adds isGatewayHealthy, startGatewayWithOptions, and startGatewayForRecovery to bin/lib/onboard.js so preflight and gateway start reuse an already-healthy named gateway instead of always recreating it.
  • Adds getNamedGatewayLifecycleState, recoverNamedGatewayRuntime, and getReconciledSandboxGatewayState to bin/nemoclaw.js to classify gateway state, attempt recovery, and reconcile stale registry entries before connecting or reporting status.
  • nemoclaw <name> connect now validates live sandbox presence via ensureLiveSandboxOrExit and starts the port-forward for 18789 reliably before connecting.
  • nemoclaw <name> status reflects gateway lifecycle conditions and removes stale registry entries only when the sandbox is confirmed missing.
  • Fixes --follow flag passthrough in nemoclaw <name> logs (was incorrectly passed as --tail).
  • Adds scripts/clean-staged-tree.sh and calls it during onboarding and setup to strip .venv, .pytest_cache, and __pycache__ from the staged build context.
📊 Macroscope summarized f61092b. 9 files reviewed, 2 issues evaluated, 1 issue filtered, 1 comment posted

🗂️ Filtered Issues

bin/lib/onboard.js — 0 comments posted, 1 evaluated, 1 filtered
  • line 1869: When selected.key === "nim-local" and models.length === 0, the code logs "No NIM models fit your GPU VRAM. Falling back to cloud API." but then immediately executes break at line 1869 without actually setting up the cloud API fallback. The function returns with model = null, preferredInferenceApi = null, and the default provider settings, but no API key validation or proper cloud setup occurs. The message promises a fallback that never happens. [ Out of scope ]

Summary by CodeRabbit

Release Notes

  • New Features

    • Gateway health detection now reuses existing healthy gateways instead of creating new ones, improving startup efficiency.
    • Enhanced sandbox status command with detailed reconciliation and recovery indicators for connection issues.
  • Bug Fixes

    • Improved stale gateway cleanup and recovery flow for more reliable reconnection after failures.
    • Fixed dashboard URL structure for more consistent access to the control interface.
  • Tests

    • Added comprehensive test coverage for gateway reuse, sandbox recovery, and lifecycle management scenarios.

kjw3 and others added 3 commits March 25, 2026 23:23
* fix: improve gateway lifecycle recovery

* docs: fix readme markdown list spacing

* fix: tighten gateway lifecycle review follow-ups

* fix: simplify tokenized control ui output

* fix: restore chat route in control ui urls

* refactor: simplify ansi stripping in onboard

* fix: shorten control ui url output

* fix: move control ui below cli next steps
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @AgentWOPR, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link

coderabbitai bot commented Mar 26, 2026

📝 Walkthrough

Walkthrough

The changes introduce gateway health checking and intelligent reuse logic to optimize NemoClaw onboarding. New helper functions analyze gateway connectivity and metadata to decide whether to reuse existing gateways or rebuild them. Sandbox commands now validate gateway liveness before executing, supported by reconciliation logic that classifies gateway and sandbox states and drives recovery flows.

Changes

Cohort / File(s) Summary
Gateway health and reuse infrastructure
bin/lib/onboard.js
Added ANSI-stripping utilities and gateway health verification functions (getActiveGatewayName, isGatewayHealthy). Refactored preflight() to query gateway status and reuse healthy gateways. Introduced startGatewayWithOptions(_gpu, { exitOnFailure } = {}) to support reuse/failure handling options, with startGateway and startGatewayForRecovery as wrappers. Simplified NVIDIA Endpoints label and refactored dashboard URL generation to use single base URL with CONTROL_UI_PATH. Exported new public functions: isGatewayHealthy, preflight, startGateway, startGatewayForRecovery.
OpenShell binary resolution and sandbox validation
bin/nemoclaw.js
Added OpenShell binary resolution with getOpenshellBinary(), runOpenshell(), and captureOpenshell() wrappers. Implemented gateway/sandbox reconciliation using named-gateway lifecycle classification and live-state inference. Made sandbox commands async with gateway validation: sandboxConnect and sandboxStatus now check gateway liveness before executing. Replaced shell-command strings with argument-vector OpenShell calls. Added recovery flow via recoverNamedGatewayRuntime() and detailed lifecycle classifications in status output.
Build and runtime script utilities
scripts/clean-staged-tree.sh, scripts/nemoclaw-start.sh, scripts/setup.sh
New script clean-staged-tree.sh removes Python environment and cache directories (.venv, .pytest_cache, __pycache__) from staged build artifacts. Reordered ulimit configuration in nemoclaw-start.sh (hard limit moved after soft limit). Integrated clean-staged-tree.sh call into setup.sh after blueprint copy, with error suppression.
Test infrastructure and gateway reuse validation
test/onboard.test.js
Added import of isGatewayHealthy function. Introduced unit test for isGatewayHealthy validating Connected status, gateway name matching, and metadata presence. Added integration test for startGateway(null) verifying gateway reuse by asserting single gateway select command without destroy/start.
Enhanced CLI test framework
test/cli.test.js
Added fs and os imports. Introduced runWithEnv(args, env, timeout) helper with configurable environment and timeout. Extended test coverage with multiple scenarios: forward --follow flag, stale registry cleanup, gateway transport failures, runtime reattachment recovery, mismatched gateway detection, ANSI-decorated error text matching, and lifecycle guidance messages.
Gateway cleanup test alignment
test/gateway-cleanup.test.js
Updated test target from startGateway to startGatewayWithOptions. Changed assertions to verify conditional stale-gateway destruction and Docker-based destroy invocation. Lowered minimum destroyGateway() call count expectation from ≥3 to ≥2.
End-to-end lifecycle and recovery validation
test/e2e/test-double-onboard.sh
Shifted test model from failure validation (invalid API key) to success/recovery validation using local fake OpenAI endpoint. Added helper functions (registry_has, run_nemoclaw, run_onboard). Expanded test phases to verify gateway reuse ("Reusing existing NemoClaw gateway"), sandbox recreation, multi-sandbox coexistence, stale registry reconciliation, and explicit lifecycle guidance after gateway stop. Updated success criteria and cleanup logic.

Sequence Diagrams

sequenceDiagram
    participant Client as CLI Client
    participant Preflight as preflight()
    participant Status as openshell status
    participant GwInfo as openshell gateway info
    participant Selector as openshell gateway select
    participant Dashboard as Dashboard Server

    Client->>Preflight: startGateway(_gpu)
    Preflight->>Status: Query gateway status
    Status-->>Preflight: statusOutput
    Preflight->>GwInfo: Query gateway metadata
    GwInfo-->>Preflight: gwInfoOutput
    Preflight->>Preflight: isGatewayHealthy(statusOutput, gwInfoOutput)
    alt Gateway is healthy
        Preflight->>Selector: select nemoclaw
        Selector-->>Preflight: ✓ selected
        Preflight->>Preflight: Set OPENSHELL_GATEWAY env
        Preflight-->>Client: ✓ Reuse existing gateway
    else Gateway is stale/missing
        Preflight->>Preflight: Destroy stale state
        Preflight->>Dashboard: Stop port forward :18789
        Preflight->>Preflight: startGateway() with full init
        Preflight-->>Client: ✓ New gateway started
    end
Loading
sequenceDiagram
    participant Client as CLI Client
    participant SandboxCmd as sandboxConnect/sandboxStatus
    participant Ensure as ensureLiveSandboxOrExit()
    participant OpenshellGW as openshell gateway get
    participant Registry as NemoClaw Registry
    participant Reconcile as Gateway Reconciliation
    participant Recover as Recovery Flow

    Client->>SandboxCmd: Run sandbox command
    SandboxCmd->>Ensure: Validate sandbox target
    Ensure->>OpenshellGW: Query active gateway
    OpenshellGW-->>Ensure: currentGateway
    Ensure->>Registry: Look up registeredGateway
    Registry-->>Ensure: gwState from registry
    Ensure->>Reconcile: Classify state (healthy/unhealthy/missing)
    alt Gateway healthy & matches target
        Reconcile-->>Ensure: healthy_named ✓
        Ensure-->>SandboxCmd: Proceed
    else Gateway unreachable/mismatched
        Reconcile->>Recover: Attempt recovery
        Recover->>Recover: startGatewayForRecovery()
        Recover-->>Ensure: recovery attempted
        Ensure->>Reconcile: Re-classify after recovery
        alt Recovery successful
            Reconcile-->>Ensure: healthy_named ✓
        else Recovery failed
            Ensure-->>Client: ✗ Exit with guidance
        end
    else Sandbox missing from gateway
        Reconcile->>Registry: Remove stale entry
        Registry-->>Ensure: ✓ cleaned
        Ensure-->>Client: ✗ Sandbox missing
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hop hop, gateways now reuse with care,
No more rebuilding what's already there!
Health checks and recovery make the path clear,
From stale state to thriving, we engineer!
With tests to confirm each clever new way. 🎯

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'sync: rebase on upstream (2026-03-26)' is vague and generic, using non-descriptive terms that don't convey meaningful information about the actual changeset. Consider a more specific title that highlights the main technical changes (e.g., 'Add gateway health verification and sandbox reconciliation logic' or similar) instead of just indicating a sync/rebase operation.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/upstream-2026-03-26

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link

Review Summary by Qodo

Improve gateway lifecycle recovery and sandbox state reconciliation

✨ Enhancement 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Improve gateway lifecycle recovery with health checks and reuse logic
• Add comprehensive gateway state reconciliation for sandbox operations
• Refactor CLI sandbox commands to handle gateway errors gracefully
• Clean up staged build artifacts and fix ulimit ordering in startup script
Diagram
flowchart LR
  A["Gateway Status Check"] -->|healthy| B["Reuse Existing Gateway"]
  A -->|stale| C["Destroy & Recreate"]
  A -->|missing| D["Create New Gateway"]
  B --> E["Skip Port Check"]
  C --> F["Verify Health"]
  D --> F
  F -->|healthy| G["Sandbox Ready"]
  F -->|failed| H["Error Recovery"]
  H -->|recoverable| I["Retry Gateway Start"]
  H -->|unrecoverable| J["Exit with Guidance"]
Loading

Grey Divider

File Changes

1. bin/lib/onboard.js ✨ Enhancement +93/-31

Add gateway health checks and reuse logic

bin/lib/onboard.js


2. bin/nemoclaw.js ✨ Enhancement +322/-14

Implement gateway lifecycle recovery and reconciliation

bin/nemoclaw.js


3. test/cli.test.js 🧪 Tests +592/-2

Add comprehensive gateway lifecycle and recovery tests

test/cli.test.js


View more (6)
4. test/gateway-cleanup.test.js 🧪 Tests +9/-8

Update gateway cleanup test expectations

test/gateway-cleanup.test.js


5. test/onboard.test.js 🧪 Tests +90/-0

Add gateway health check and reuse scenario tests

test/onboard.test.js


6. test/e2e/test-double-onboard.sh 🧪 Tests +288/-109

Refactor E2E test with fake endpoint and lifecycle phases

test/e2e/test-double-onboard.sh


7. scripts/clean-staged-tree.sh ✨ Enhancement +15/-0

New script to clean Python cache from staged build

scripts/clean-staged-tree.sh


8. scripts/nemoclaw-start.sh 🐞 Bug fix +3/-3

Fix ulimit ordering - soft before hard

scripts/nemoclaw-start.sh


9. scripts/setup.sh ✨ Enhancement +1/-0

Call clean-staged-tree during sandbox build

scripts/setup.sh


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 26, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

Grey Divider


Remediation recommended

1. Stale gateway skips volume cleanup 🐞 Bug ⛯ Reliability
Description
In startGatewayWithOptions(), stale gateways are destroyed via openshell gateway destroy without
calling destroyGateway(), even though destroyGateway() documents that volume cleanup is required to
avoid corrupted cluster state breaking the next gateway start. This can cause the first
gateway-start attempt after a stale gateway to fail (then clean up volumes) and force the user to
rerun onboarding.
Code

bin/lib/onboard.js[R1354-1356]

+  if (hasStaleGateway(gwInfo)) {
+    runOpenshell(["gateway", "destroy", "-g", GATEWAY_NAME], { ignoreError: true });
+  }
Evidence
destroyGateway() explicitly states that openshell gateway destroy does not remove Docker volumes
and that leftover volumes can corrupt cluster state and break the next gateway start; however, the
stale-gateway cleanup path calls openshell gateway destroy directly and does not run the volume
cleanup step.

bin/lib/onboard.js[1333-1338]
bin/lib/onboard.js[1354-1356]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`startGatewayWithOptions()` currently handles a stale NemoClaw gateway by running `openshell gateway destroy -g nemoclaw` directly. The repo’s own `destroyGateway()` function documents that this does **not** remove the openshell-cluster Docker volumes and that leftover volumes can corrupt cluster state and break the next gateway start.

### Issue Context
This means the first start attempt after a stale gateway can still fail due to leftover volumes, and only then does the code call `destroyGateway()` (on failure), forcing users into a rerun loop.

### Fix Focus Areas
- bin/lib/onboard.js[1354-1356]
- bin/lib/onboard.js[1333-1338]

### Suggested change
Replace the direct `runOpenshell(["gateway","destroy",...])` stale cleanup with a call to `destroyGateway()` (or factor a helper that runs both gateway destroy and volume cleanup) so stale cleanup reliably removes volumes before attempting `gateway start`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. nproc ulimit not retried 🐞 Bug ⛯ Reliability
Description
The entrypoint sets the soft nproc limit before attempting to set the hard limit and never retries
the soft limit if the hard-limit change succeeds afterward. This can leave the container with an
unnecessarily low soft process limit (relative to the intended 512), causing avoidable process-spawn
failures under load.
Code

scripts/nemoclaw-start.sh[R21-26]

if ! ulimit -Su 512 2>/dev/null; then
  echo "[SECURITY] Could not set soft nproc limit (container runtime may restrict ulimit)" >&2
fi
+if ! ulimit -Hu 512 2>/dev/null; then
+  echo "[SECURITY] Could not set hard nproc limit (container runtime may restrict ulimit)" >&2
+fi
Evidence
The hardening block intends to set both soft and hard nproc limits to 512, but the script attempts
ulimit -Su 512 first and does not reattempt it after the hard limit is set, even though the log
message implies a best-effort hardening step rather than leaving a degraded configuration.

scripts/nemoclaw-start.sh[18-26]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`scripts/nemoclaw-start.sh` sets `ulimit -Su 512` before `ulimit -Hu 512` and does not retry the soft limit after the hard limit is updated. If raising the hard limit enables raising the soft limit, the current order can leave the runtime with a lower-than-intended soft nproc limit.

### Issue Context
The comments indicate this is a security hardening step that should be best-effort and robust across container runtimes.

### Fix Focus Areas
- scripts/nemoclaw-start.sh[18-26]

### Suggested change
Either:
1) Set hard first, then soft (preferred), or
2) If soft fails, attempt hard, then retry soft once.

Ensure the final effective soft limit is as close as possible to the intended value when the runtime permits it, while keeping the current non-blocking behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

if command -v nemoclaw >/dev/null 2>&1; then
nemoclaw "$SANDBOX_A" destroy --yes 2>/dev/null || true
nemoclaw "$SANDBOX_B" destroy --yes 2>/dev/null || true
if [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium e2e/test-double-onboard.sh:179

The pre-cleanup guard at line 179 uses [ -x "$REPO_ROOT/bin/nemoclaw.js" ] (executable), but the NEMOCLAW_CMD assignment at line 60 only checks [ -f "$REPO_ROOT/bin/nemoclaw.js" ] (exists). When node is available and the .js file exists without execute permission, NEMOCLAW_CMD is set correctly but the cleanup condition fails, skipping destruction of stale sandboxes and causing "Sandbox already exists" errors on subsequent runs. Change line 179 from -x to -f to match the condition at line 60.

-if [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then
+if [ -f "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file test/e2e/test-double-onboard.sh around line 179:

The pre-cleanup guard at line 179 uses `[ -x "$REPO_ROOT/bin/nemoclaw.js" ]` (executable), but the `NEMOCLAW_CMD` assignment at line 60 only checks `[ -f "$REPO_ROOT/bin/nemoclaw.js" ]` (exists). When `node` is available and the `.js` file exists without execute permission, `NEMOCLAW_CMD` is set correctly but the cleanup condition fails, skipping destruction of stale sandboxes and causing "Sandbox already exists" errors on subsequent runs. Change line 179 from `-x` to `-f` to match the condition at line 60.

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@greptile-apps
Copy link

greptile-apps bot commented Mar 26, 2026

Greptile Summary

This upstream sync introduces gateway lifecycle awareness across the NemoClaw CLI. The primary change teaches onboard, connect, and status to detect whether the named NemoClaw gateway is already healthy before deciding to reuse, repair, or rebuild it. A new multi-stage recovery path (recoverNamedGatewayRuntime → getReconciledSandboxGatewayState) classifies gateway errors into distinct states (identity_drift, gateway_unreachable_after_restart, gateway_missing_after_restart) and surfaces targeted remediation guidance instead of a generic failure. The e2e test is also substantially improved — it now drives a local fake OpenAI-compatible endpoint instead of relying on a missing NVIDIA_API_KEY to force a late failure.

Key changes:

  • bin/lib/onboard.js: new isGatewayHealthy, getActiveGatewayName, stripAnsi helpers; startGateway refactored into startGatewayWithOptions({ exitOnFailure }) + thin startGateway / startGatewayForRecovery wrappers; preflight reuses healthy gateways.
  • bin/nemoclaw.js: new gateway lifecycle state machine (getNamedGatewayLifecycleState, recoverNamedGatewayRuntime, ensureLiveSandboxOrExit); sandboxConnect and sandboxStatus made async; sandboxLogs corrects --tail--follow.
  • scripts/clean-staged-tree.sh: new helper to strip Python artefacts from blueprint before sandbox build.
  • P1 issue found: isGatewayHealthy() in onboard.js checks statusOutput.includes("Connected") without ANSI stripping, while its sibling getActiveGatewayName() calls stripAnsi(). The equivalent implementation in nemoclaw.js correctly strips ANSI before the Connected check. If openshell emits a coloured status line, isGatewayHealthy() returns false for a live gateway, funnelling execution into the else if (hasStaleGateway) branch and destroying a healthy gateway unnecessarily.
  • P2 issue found: stripAnsi and getActiveGatewayName are duplicated between onboard.js and nemoclaw.js with divergent regex patterns — onboard.js handles all CSI sequences while nemoclaw.js only handles SGR colour sequences.

Confidence Score: 4/5

Safe to merge after fixing the ANSI stripping inconsistency in isGatewayHealthy; all other changes are well-tested and low-risk.

The PR is substantive and well-covered by tests — 9 new unit tests, an improved e2e suite, and a corrected --follow flag. One targeted fix is needed: the connected check in isGatewayHealthy() must call stripAnsi() to match the approach used everywhere else in this PR. Without it, a coloured openshell status line can silently cause healthy-gateway teardown on every onboard run. The duplicate stripAnsi implementations are a maintenance concern but not a blocker.

bin/lib/onboard.js — specifically the isGatewayHealthy function's connected check around line 25.

Important Files Changed

Filename Overview
bin/lib/onboard.js Adds gateway-reuse logic (isGatewayHealthy, getActiveGatewayName, stripAnsi), refactors startGateway into startGatewayWithOptions; connected check in isGatewayHealthy skips ANSI stripping — P1 logic issue that can cause healthy gateway teardown.
bin/nemoclaw.js Adds rich gateway lifecycle state machine (recoverNamedGatewayRuntime, getReconciledSandboxGatewayState, ensureLiveSandboxOrExit); sandbox connect/status made async; duplicates stripAnsi/getActiveGatewayName with a narrower regex than onboard.js.
test/cli.test.js Adds 9 new integration tests covering gateway lifecycle states, ANSI-decorated errors, stale registry cleanup, and --follow log streaming; thorough and well-structured.
test/e2e/test-double-onboard.sh Rewrites e2e test to use a local fake OpenAI endpoint, adds phases for gateway reuse, multi-sandbox coexistence, stale registry reconciliation, and lifecycle response after gateway stop.
test/onboard.test.js Adds unit tests for isGatewayHealthy and gateway-reuse path in startGateway; no ANSI test cases for isGatewayHealthy, leaving the onboard.js ANSI inconsistency undetected.
scripts/clean-staged-tree.sh New helper script that removes Python build artefacts (.venv, .pytest_cache, pycache) from a staged blueprint directory before sandbox build.
scripts/nemoclaw-start.sh Swaps ulimit order to set soft limit before hard limit; functionally equivalent for the 512-process cap use-case.
scripts/setup.sh Calls clean-staged-tree.sh after copying the nemoclaw-blueprint to the build context; straightforward and low-risk.
test/gateway-cleanup.test.js Updates assertion to match startGatewayWithOptions refactor; lowers expected destroyGateway call count from ≥3 to ≥2, consistent with the new conditional stale-cleanup logic.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nemoclaw onboard / connect / status] --> B[getReconciledSandboxGatewayState]
    B --> C[getSandboxGatewayState\nopenshell sandbox get]
    C -->|present| D[✅ Return present]
    C -->|missing\nNotFound| E[✅ Return missing\nremove stale registry]
    C -->|gateway_error\nConnection refused /\nhandshake failed /\nauth token missing| F[recoverNamedGatewayRuntime]
    F --> G[getNamedGatewayLifecycleState]
    G -->|healthy_named| H[Return recovered=true]
    G -->|named_unhealthy\nnamed_unreachable\nconnected_other| I[startGatewayForRecovery\nexitOnFailure=false]
    G -->|missing_named| J[gateway select nemoclaw\nre-check state]
    I -->|success or throws| K[gateway select nemoclaw\nre-check state]
    K -->|healthy_named| H
    K -->|still unhealthy| L[Recovery failed]
    J -->|still missing| L
    H --> M[Retry getSandboxGatewayState]
    M -->|present / missing| N[✅ Return with recoveredGateway=true]
    M -->|handshake failed| O[Return identity_drift]
    L --> Q[getNamedGatewayLifecycleState\nclassify final state]
    Q -->|No gateway configured| R[gateway_missing_after_restart]
    Q -->|Connection refused| S[gateway_unreachable_after_restart]
    Q -->|other| T[Return gateway_error\ngatewayRecoveryFailed=true]
    D & E & N & O & R & S & T --> U[sandboxConnect / sandboxStatus\nprint guidance or proceed]
Loading

Comments Outside Diff (2)

  1. bin/lib/onboard.js, line 25-29 (link)

    P1 isGatewayHealthy skips ANSI stripping on the connected check

    getActiveGatewayName() correctly calls stripAnsi() before pattern-matching, but the connected flag reads the raw statusOutput without stripping. If openshell emits a coloured status line like Status: \x1b[32mConnected\x1b[0m, statusOutput.includes("Connected") returns false even though the gateway is live.

    This matters because isGatewayHealthy() gates the "reuse vs destroy" decision in both preflight() and startGatewayWithOptions(). A false false funnels execution into the else if (hasStaleGateway(gwInfo)) branch, which destroys a perfectly healthy gateway.

    The companion implementation in nemoclaw.js (getNamedGatewayLifecycleState()) already applies stripAnsi() before the Connected check — this one should do the same.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: bin/lib/onboard.js
    Line: 25-29
    
    Comment:
    **`isGatewayHealthy` skips ANSI stripping on the `connected` check**
    
    `getActiveGatewayName()` correctly calls `stripAnsi()` before pattern-matching, but the `connected` flag reads the raw `statusOutput` without stripping. If openshell emits a coloured status line like `Status: \x1b[32mConnected\x1b[0m`, `statusOutput.includes("Connected")` returns `false` even though the gateway is live.
    
    This matters because `isGatewayHealthy()` gates the "reuse vs destroy" decision in both `preflight()` and `startGatewayWithOptions()`. A false `false` funnels execution into the `else if (hasStaleGateway(gwInfo))` branch, which destroys a perfectly healthy gateway.
    
    The companion implementation in `nemoclaw.js` (`getNamedGatewayLifecycleState()`) already applies `stripAnsi()` before the Connected check — this one should do the same.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Claude Code

  2. bin/nemoclaw.js, line 301-313 (link)

    P2 Divergent stripAnsi / getActiveGatewayName duplicates across modules

    stripAnsi() and getActiveGatewayName() are now defined in both bin/lib/onboard.js and bin/nemoclaw.js with subtly different regex patterns:

    • onboard.js: /\x1b\[[0-9;]*[A-Za-z]/g — strips any CSI sequence (cursor movement, erase, etc.)
    • nemoclaw.js: /\x1b\[[0-9;]*m/g — strips only SGR (colour/style) sequences

    This means gateway-state matching in the two files can diverge if openshell emits non-SGR escape sequences. Consider extracting both helpers to bin/lib/ansi.js (or bin/lib/openshell-util.js) and importing from both call-sites to keep the behaviour consistent.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: bin/nemoclaw.js
    Line: 301-313
    
    Comment:
    **Divergent `stripAnsi` / `getActiveGatewayName` duplicates across modules**
    
    `stripAnsi()` and `getActiveGatewayName()` are now defined in both `bin/lib/onboard.js` and `bin/nemoclaw.js` with subtly different regex patterns:
    
    - `onboard.js`: `/\x1b\[[0-9;]*[A-Za-z]/g` — strips any CSI sequence (cursor movement, erase, etc.)
    - `nemoclaw.js`: `/\x1b\[[0-9;]*m/g` — strips only SGR (colour/style) sequences
    
    This means gateway-state matching in the two files can diverge if openshell emits non-SGR escape sequences. Consider extracting both helpers to `bin/lib/ansi.js` (or `bin/lib/openshell-util.js`) and importing from both call-sites to keep the behaviour consistent.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Claude Code

Fix All in Claude Code

Prompt To Fix All With AI
This is a comment left during a code review.
Path: bin/lib/onboard.js
Line: 25-29

Comment:
**`isGatewayHealthy` skips ANSI stripping on the `connected` check**

`getActiveGatewayName()` correctly calls `stripAnsi()` before pattern-matching, but the `connected` flag reads the raw `statusOutput` without stripping. If openshell emits a coloured status line like `Status: \x1b[32mConnected\x1b[0m`, `statusOutput.includes("Connected")` returns `false` even though the gateway is live.

This matters because `isGatewayHealthy()` gates the "reuse vs destroy" decision in both `preflight()` and `startGatewayWithOptions()`. A false `false` funnels execution into the `else if (hasStaleGateway(gwInfo))` branch, which destroys a perfectly healthy gateway.

The companion implementation in `nemoclaw.js` (`getNamedGatewayLifecycleState()`) already applies `stripAnsi()` before the Connected check — this one should do the same.

```suggestion
function isGatewayHealthy(statusOutput = "", gwInfoOutput = "") {
  const cleanStatus = stripAnsi(statusOutput);
  const connected = typeof cleanStatus === "string" && cleanStatus.includes("Connected");
  const activeGateway = getActiveGatewayName(statusOutput);
  return connected && activeGateway === GATEWAY_NAME && hasStaleGateway(gwInfoOutput);
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: bin/nemoclaw.js
Line: 301-313

Comment:
**Divergent `stripAnsi` / `getActiveGatewayName` duplicates across modules**

`stripAnsi()` and `getActiveGatewayName()` are now defined in both `bin/lib/onboard.js` and `bin/nemoclaw.js` with subtly different regex patterns:

- `onboard.js`: `/\x1b\[[0-9;]*[A-Za-z]/g` — strips any CSI sequence (cursor movement, erase, etc.)
- `nemoclaw.js`: `/\x1b\[[0-9;]*m/g` — strips only SGR (colour/style) sequences

This means gateway-state matching in the two files can diverge if openshell emits non-SGR escape sequences. Consider extracting both helpers to `bin/lib/ansi.js` (or `bin/lib/openshell-util.js`) and importing from both call-sites to keep the behaviour consistent.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 158-170: The regex checks in getSandboxGatewayState are running
against raw, ANSI-decorated output so matches like "NotFound" or "Missing
gateway auth token" can be missed; update getSandboxGatewayState (which calls
captureOpenshell) to strip ANSI escape sequences (and optionally normalize case)
before running the regex tests—e.g., create a cleanedOutput by removing ANSI
control sequences from result.output (keep original output for return) and run
the three regex tests against cleanedOutput so gateway_error/missing/present are
correctly classified.
- Around line 209-210: The code prematurely returns when lookup.state ===
"missing"; change the logic so NotFound/“missing” is only treated as
authoritative after confirming NemoClaw is the selected and healthy gateway
(e.g., check the active gateway via the selection function you use—such as
getActiveGateway()/selectGateway()—and/or an isGatewayHealthy() check) before
returning lookup; in practice, wrap the existing "if (lookup.state ===
'missing') return lookup;" with a condition that currentGateway === 'nemoclaw'
&& isNemoClawHealthy() (or move the missing-state handling to after gateway
selection/health verification) so other gateways' legitimate "not found"
responses don't cause deletion by status/connect.

In `@scripts/clean-staged-tree.sh`:
- Around line 14-15: The current cleanup only removes top-level .venv and
.pytest_cache; update the cleanup to recursively find and remove any directories
named ".venv" or ".pytest_cache" under the staged tree by replacing the
non-recursive rm invocation with a find-based removal that searches
"$target_dir" for directories matching those names and deletes them (preserve
the existing 2>/dev/null || true behavior to ignore errors); reference the
variables and patterns in the script (target_dir, ".venv", ".pytest_cache", and
the existing find usage for "__pycache__") to locate and change the commands.

In `@test/e2e/test-double-onboard.sh`:
- Around line 179-181: The CLI availability check uses "-x" which requires the
exec bit but NEMOCLAW_CMD treats a present file as usable; change the predicate
to match NEMOCLAW_CMD by checking file existence (-f) or the `command -v` path.
Replace occurrences of [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v
nemoclaw >/dev/null 2>&1 with [ -f "$REPO_ROOT/bin/nemoclaw.js" ] || command -v
nemoclaw >/dev/null 2>&1 in both the Phase 0 block (where run_nemoclaw
"$SANDBOX_A" destroy … and run_nemoclaw "$SANDBOX_B" destroy … are called) and
the similar checks around the later lines (208-213) so the same predicate is
used everywhere.

In `@test/onboard.test.js`:
- Around line 521-526: The test currently asserts commands.length === 1 which is
brittle; change the assertions around the parsed commands (the commands
variable) to stop requiring exact length and instead assert that at least one
command matches the select for "nemoclaw" and that none match gateway 'destroy'
or 'start'. Concretely, replace or remove assert.equal(commands.length, 1) and
use an existence check (e.g., assert.ok(commands.some(c => /gateway' 'select'
'nemoclaw'/.test(c)))) while keeping the two assert.doesNotMatch checks to
ensure no 'destroy'/'start' commands are present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5b328ffc-b5a0-4d0f-8af1-171c47fa2aca

📥 Commits

Reviewing files that changed from the base of the PR and between d124a6d and f61092b.

📒 Files selected for processing (9)
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • scripts/clean-staged-tree.sh
  • scripts/nemoclaw-start.sh
  • scripts/setup.sh
  • test/cli.test.js
  • test/e2e/test-double-onboard.sh
  • test/gateway-cleanup.test.js
  • test/onboard.test.js

Comment on lines +158 to +170
function getSandboxGatewayState(sandboxName) {
const result = captureOpenshell(["sandbox", "get", sandboxName]);
const output = result.output;
if (result.status === 0) {
return { state: "present", output };
}
if (/NotFound|sandbox not found/i.test(output)) {
return { state: "missing", output };
}
if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(output)) {
return { state: "gateway_error", output };
}
return { state: "unknown_error", output };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Classify sandbox get failures on stripped output.

These regexes inspect raw output. ANSI-decorated NotFound, transport error, or Missing gateway auth token currently fall through to unknown_error, which skips both recovery and stale-entry reconciliation.

Suggested fix
 function getSandboxGatewayState(sandboxName) {
   const result = captureOpenshell(["sandbox", "get", sandboxName]);
   const output = result.output;
+  const cleanOutput = stripAnsi(output);
   if (result.status === 0) {
     return { state: "present", output };
   }
-  if (/NotFound|sandbox not found/i.test(output)) {
+  if (/NotFound|sandbox not found/i.test(cleanOutput)) {
     return { state: "missing", output };
   }
-  if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(output)) {
+  if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(cleanOutput)) {
     return { state: "gateway_error", output };
   }
   return { state: "unknown_error", output };
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function getSandboxGatewayState(sandboxName) {
const result = captureOpenshell(["sandbox", "get", sandboxName]);
const output = result.output;
if (result.status === 0) {
return { state: "present", output };
}
if (/NotFound|sandbox not found/i.test(output)) {
return { state: "missing", output };
}
if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(output)) {
return { state: "gateway_error", output };
}
return { state: "unknown_error", output };
function getSandboxGatewayState(sandboxName) {
const result = captureOpenshell(["sandbox", "get", sandboxName]);
const output = result.output;
const cleanOutput = stripAnsi(output);
if (result.status === 0) {
return { state: "present", output };
}
if (/NotFound|sandbox not found/i.test(cleanOutput)) {
return { state: "missing", output };
}
if (/transport error|Connection refused|handshake verification failed|Missing gateway auth token|device identity required/i.test(cleanOutput)) {
return { state: "gateway_error", output };
}
return { state: "unknown_error", output };
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 158 - 170, The regex checks in
getSandboxGatewayState are running against raw, ANSI-decorated output so matches
like "NotFound" or "Missing gateway auth token" can be missed; update
getSandboxGatewayState (which calls captureOpenshell) to strip ANSI escape
sequences (and optionally normalize case) before running the regex tests—e.g.,
create a cleanedOutput by removing ANSI control sequences from result.output
(keep original output for return) and run the three regex tests against
cleanedOutput so gateway_error/missing/present are correctly classified.

Comment on lines +209 to +210
if (lookup.state === "missing") {
return lookup;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t treat NotFound as authoritative before selecting NemoClaw.

This returns missing before you verify that nemoclaw is the active/healthy gateway. If another gateway is selected, openshell sandbox get <name> can legitimately say “not found”, and status/connect will delete a sandbox that still exists on NemoClaw.

Suggested fix
-  if (lookup.state === "missing") {
-    return lookup;
-  }
+  if (lookup.state === "missing") {
+    const lifecycle = getNamedGatewayLifecycleState();
+    if (lifecycle.state !== "healthy_named") {
+      const recovery = await recoverNamedGatewayRuntime();
+      const retried = getSandboxGatewayState(sandboxName);
+      if (retried.state !== "missing") {
+        return { ...retried, recoveredGateway: recovery.recovered, recoveryVia: recovery.via || null };
+      }
+    }
+    return lookup;
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 209 - 210, The code prematurely returns when
lookup.state === "missing"; change the logic so NotFound/“missing” is only
treated as authoritative after confirming NemoClaw is the selected and healthy
gateway (e.g., check the active gateway via the selection function you use—such
as getActiveGateway()/selectGateway()—and/or an isGatewayHealthy() check) before
returning lookup; in practice, wrap the existing "if (lookup.state ===
'missing') return lookup;" with a condition that currentGateway === 'nemoclaw'
&& isNemoClawHealthy() (or move the missing-state handling to after gateway
selection/health verification) so other gateways' legitimate "not found"
responses don't cause deletion by status/connect.

Comment on lines +14 to +15
rm -rf "$target_dir/.venv" "$target_dir/.pytest_cache"
find "$target_dir" -type d -name __pycache__ -prune -exec rm -rf {} + 2>/dev/null || true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make .venv and .pytest_cache cleanup recursive too.

Right now only the root-level copies are removed. Nested virtualenvs and pytest caches under the staged tree will still get copied into the Docker context.

Suggested fix
-rm -rf "$target_dir/.venv" "$target_dir/.pytest_cache"
-find "$target_dir" -type d -name __pycache__ -prune -exec rm -rf {} + 2>/dev/null || true
+find "$target_dir" -type d \( -name .venv -o -name .pytest_cache -o -name __pycache__ \) -prune -exec rm -rf -- {} + 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rm -rf "$target_dir/.venv" "$target_dir/.pytest_cache"
find "$target_dir" -type d -name __pycache__ -prune -exec rm -rf {} + 2>/dev/null || true
find "$target_dir" -type d \( -name .venv -o -name .pytest_cache -o -name __pycache__ \) -prune -exec rm -rf -- {} + 2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/clean-staged-tree.sh` around lines 14 - 15, The current cleanup only
removes top-level .venv and .pytest_cache; update the cleanup to recursively
find and remove any directories named ".venv" or ".pytest_cache" under the
staged tree by replacing the non-recursive rm invocation with a find-based
removal that searches "$target_dir" for directories matching those names and
deletes them (preserve the existing 2>/dev/null || true behavior to ignore
errors); reference the variables and patterns in the script (target_dir,
".venv", ".pytest_cache", and the existing find usage for "__pycache__") to
locate and change the commands.

Comment on lines +179 to +181
if [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then
run_nemoclaw "$SANDBOX_A" destroy --yes 2>/dev/null || true
run_nemoclaw "$SANDBOX_B" destroy --yes 2>/dev/null || true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align the CLI availability checks with NEMOCLAW_CMD.

NEMOCLAW_CMD runs bin/nemoclaw.js via node when the file merely exists, but these checks require the exec bit. In a repo checkout without that bit, Phase 1 fails early and Phase 0 skips cleanup even though the command is usable.

Suggested fix
-if [ -x "$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1; then
+if { command -v node >/dev/null 2>&1 && [ -f "$REPO_ROOT/bin/nemoclaw.js" ]; } || command -v nemoclaw >/dev/null 2>&1; then

Apply the same predicate in both Phase 0 and Phase 1.

Also applies to: 208-213

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-double-onboard.sh` around lines 179 - 181, The CLI availability
check uses "-x" which requires the exec bit but NEMOCLAW_CMD treats a present
file as usable; change the predicate to match NEMOCLAW_CMD by checking file
existence (-f) or the `command -v` path. Replace occurrences of [ -x
"$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1 with [ -f
"$REPO_ROOT/bin/nemoclaw.js" ] || command -v nemoclaw >/dev/null 2>&1 in both
the Phase 0 block (where run_nemoclaw "$SANDBOX_A" destroy … and run_nemoclaw
"$SANDBOX_B" destroy … are called) and the similar checks around the later lines
(208-213) so the same predicate is used everywhere.

Comment on lines +521 to +526
assert.equal(result.status, 0, result.stderr);
const commands = JSON.parse(result.stdout.trim().split("\n").pop());
assert.equal(commands.length, 1);
assert.match(commands[0], /gateway' 'select' 'nemoclaw'/);
assert.doesNotMatch(commands[0], /gateway' 'destroy'/);
assert.doesNotMatch(commands[0], /gateway' 'start'/);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Relax the one-command assertion in the gateway-reuse test.

The healthy reuse path currently stops the 18789 forward before selecting nemoclaw, so commands.length === 1 will fail even when reuse is working.

Suggested fix
-    assert.equal(commands.length, 1);
-    assert.match(commands[0], /gateway' 'select' 'nemoclaw'/);
-    assert.doesNotMatch(commands[0], /gateway' 'destroy'/);
-    assert.doesNotMatch(commands[0], /gateway' 'start'/);
+    assert.ok(commands.some((command) => /gateway' 'select' 'nemoclaw'/.test(command)));
+    assert.doesNotMatch(commands.join("\n"), /gateway' 'destroy'/);
+    assert.doesNotMatch(commands.join("\n"), /gateway' 'start'/);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 521 - 526, The test currently asserts
commands.length === 1 which is brittle; change the assertions around the parsed
commands (the commands variable) to stop requiring exact length and instead
assert that at least one command matches the select for "nemoclaw" and that none
match gateway 'destroy' or 'start'. Concretely, replace or remove
assert.equal(commands.length, 1) and use an existence check (e.g.,
assert.ok(commands.some(c => /gateway' 'select' 'nemoclaw'/.test(c)))) while
keeping the two assert.doesNotMatch checks to ensure no 'destroy'/'start'
commands are present.

Copy link

@TSavo TSavo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream sync. Unit tests pass. Qodo findings are upstream edge cases — would conflict on next sync if fixed here. No branch protection.

@TSavo TSavo merged commit ac1e79a into main Mar 26, 2026
13 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants