fix: address core blocker lifecycle regressions#1208
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds unified-memory NVIDIA GPU detection with an exported memory-based NIM capability helper, tightens policy YAML extraction/normalization, performs OpenShell gateway cleanup when the last sandbox is destroyed, and relaxes startup failure for missing NVIDIA_API_KEY while requiring both TELEGRAM_BOT_TOKEN and NVIDIA_API_KEY to start the Telegram bridge. Changes
Sequence Diagram(s)sequenceDiagram
participant Detector as nim.detectGpu
participant Runner as runner.runCapture
participant NVSMI as "nvidia-smi"
participant SysMem as "free -m"
participant Logic as canRunNimWithMemory
Detector->>Runner: runCapture("nvidia-smi --query-gpu=name --format=csv,noheader")
Runner-->>Detector: GPU names list
Detector->>Runner: runCapture("free -m")
Runner-->>Detector: total memory MB
Detector->>Logic: call canRunNimWithMemory(totalMemoryMB)
Logic-->>Detector: nimCapable (true/false)
Detector-->>Detector: set unifiedMemory, spark, name, count, perGpuMB
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/policies.js`:
- Around line 78-86: The candidate extraction currently uses regex-only checks
and can allow truncated/invalid YAML (e.g., "version: 1\nnetwork_policies") to
be treated as valid; instead, after computing candidate in the same block (the
variable named candidate), attempt to parse it as YAML (using the project's YAML
parser used elsewhere) and if parsing fails or the result is not an
object/expected structure, return "" so mergePresetIntoPolicy() won't reuse
malformed YAML; keep the existing regex guards but add this strict syntactic
validation step before returning candidate.
In `@bin/nemoclaw.js`:
- Around line 1100-1105: The current cleanup calls
cleanupGatewayAfterLastSandbox based only on registry.removeSandbox and
registry.listSandboxes; instead, ensure the openshell deletion actually
succeeded and confirm there are no live sandboxes reported by openshell before
tearing down the shared gateway: capture the result/exit status of
runOpenshell(["sandbox","delete", sandboxName]) and only proceed if it indicates
success, then run the equivalent of openshell sandbox list (e.g., via
runOpenshell(["sandbox","list"]) or the helper that lists live sandboxes) and
verify its output shows zero live sandboxes; only when both the delete succeeded
and the live openshell list is empty should you call
cleanupGatewayAfterLastSandbox (keep using
registry.removeSandbox/registry.listSandboxes as local updates but do not gate
teardown on them alone).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fb24de21-f522-42b8-bafb-0e77ec27eba0
📒 Files selected for processing (8)
bin/lib/nim.jsbin/lib/policies.jsbin/nemoclaw.jsscripts/start-services.shtest/cli.test.jstest/nim.test.jstest/policies.test.jstest/service-env.test.js
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
bin/nemoclaw.js (1)
1109-1120:⚠️ Potential issue | 🔴 CriticalOnly remove the local sandbox after a successful OpenShell delete.
Line 1111 still drops the registry entry when
openshell sandbox deletefails, and Line 1120 still reports success. A transient delete failure will leave the sandbox running but make it disappear locally, which also blocks a clean retry.💡 Suggested fix
const deleteResult = runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true }); + if (deleteResult.status !== 0) { + exitWithSpawnResult(deleteResult); + } const removed = registry.removeSandbox(sandboxName); if ( - deleteResult.status === 0 && removed && registry.listSandboxes().sandboxes.length === 0 && hasNoLiveSandboxes() ) { cleanupGatewayAfterLastSandbox(); }
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/cli.test.js`:
- Around line 357-360: The test mock for the "sandbox list" command is missing
the table header so hasNoLiveSandboxes() (which calls parseLiveSandboxNames())
skips the only line and finds no sandboxes; update the mocked output to include
the same header parseLiveSandboxNames() expects (e.g., the "NAME STATUS" or
exact header string used by parseLiveSandboxNames()) as the first printed line
followed by the live sandbox row ("beta Ready") so the parser will recognize a
live sandbox in the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 98d219de-6863-429b-8f68-0608dc862207
📒 Files selected for processing (4)
bin/lib/policies.jsbin/nemoclaw.jstest/cli.test.jstest/policies.test.js
✅ Files skipped from review due to trivial changes (1)
- test/policies.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- bin/lib/policies.js
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/cli.test.js (1)
357-360:⚠️ Potential issue | 🟡 MinorEmit the real
sandbox listheader in this mock.
hasNoLiveSandboxes()parsessandbox listoutput, and this fixture only printsbeta Ready. If the parser still treats the first row as the table header, this case models zero live sandboxes and can drop into the cleanup path instead of the “keep runtime” branch.💡 Suggested fix
'if [ "$1" = "sandbox" ] && [ "$2" = "list" ]; then', - ' printf "beta Ready\\n" >> "$log_file"', - ' printf "beta Ready\\n"', + ' printf "NAME STATUS\\nbeta Ready\\n" >> "$log_file"', + ' printf "NAME STATUS\\nbeta Ready\\n"', " exit 0",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/cli.test.js` around lines 357 - 360, The test mock for the `sandbox list` command only prints "beta Ready", which causes hasNoLiveSandboxes() to misinterpret that row as the table header; update the mocked output in the test/cli.test.js snippet (the shell block that prints to "$log_file") to emit the real `sandbox list` header line expected by hasNoLiveSandboxes() (e.g., the same column header string used by the parser) before printing the "beta Ready" row so the parser treats it as data and the test follows the “keep runtime” branch.
🧹 Nitpick comments (1)
test/cli.test.js (1)
321-325: Assert the local-registry short-circuit explicitly.This case only proves cleanup was skipped. It would still pass if
sandboxDestroy()queriedsandbox listeven when local sandboxes still remained, so it doesn’t actually lock in the branch split added aroundregistry.listSandboxes().sandboxes.length === 0.💡 Suggested assertion
expect(r.code).toBe(0); expect(fs.readFileSync(openshellLog, "utf8")).toContain("sandbox delete alpha"); + expect(fs.readFileSync(openshellLog, "utf8")).not.toContain("NAME STATUS"); expect(fs.readFileSync(openshellLog, "utf8")).not.toContain("forward stop 18789"); expect(fs.readFileSync(openshellLog, "utf8")).not.toContain("gateway destroy -g nemoclaw"); expect(fs.readFileSync(bashLog, "utf8")).not.toContain("docker volume ls -q --filter");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/cli.test.js` around lines 321 - 325, The test currently only checks cleanup was skipped but doesn't ensure the local-registry short-circuit branch in sandboxDestroy was taken; add an explicit assertion that registry.listSandboxes() was not invoked (or was not awaited) to prove the short-circuit around registry.listSandboxes().sandboxes.length === 0 was used. Instrument the test to spy/mock the registry object's listSandboxes method (the same registry used by sandboxDestroy) before calling the CLI path, then after execution assert the spy was not called (e.g., expect(listSandboxesSpy).not.toHaveBeenCalled()) to lock in the branch behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/cli.test.js`:
- Around line 357-360: The test mock for the `sandbox list` command only prints
"beta Ready", which causes hasNoLiveSandboxes() to misinterpret that row as the
table header; update the mocked output in the test/cli.test.js snippet (the
shell block that prints to "$log_file") to emit the real `sandbox list` header
line expected by hasNoLiveSandboxes() (e.g., the same column header string used
by the parser) before printing the "beta Ready" row so the parser treats it as
data and the test follows the “keep runtime” branch.
---
Nitpick comments:
In `@test/cli.test.js`:
- Around line 321-325: The test currently only checks cleanup was skipped but
doesn't ensure the local-registry short-circuit branch in sandboxDestroy was
taken; add an explicit assertion that registry.listSandboxes() was not invoked
(or was not awaited) to prove the short-circuit around
registry.listSandboxes().sandboxes.length === 0 was used. Instrument the test to
spy/mock the registry object's listSandboxes method (the same registry used by
sandboxDestroy) before calling the CLI path, then after execution assert the spy
was not called (e.g., expect(listSandboxesSpy).not.toHaveBeenCalled()) to lock
in the branch behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: acbb22b6-c4c9-4023-9a5b-28a7dd0ad0b8
📒 Files selected for processing (1)
test/cli.test.js
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/cli.test.js (1)
295-298: Isolate the local-registry guard in this case.This mock also makes
sandbox listreturnbeta Ready, so cleanup stays skipped even if theregistry.listSandboxes().sandboxes.length === 0check insandboxDestroy()regresses. Returning an empty live list here would make this test actually pin the “other sandboxes still exist locally” branch.♻️ Suggested tweak
'if [ "$1" = "sandbox" ] && [ "$2" = "list" ]; then', - ' printf "NAME STATUS\\nbeta Ready\\n" >> "$log_file"', - ' printf "NAME STATUS\\nbeta Ready\\n"', + ' printf "NAME STATUS\\n" >> "$log_file"', + ' printf "NAME STATUS\\n"', " exit 0", "fi",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/cli.test.js` around lines 295 - 298, The mock for the shell case 'if [ "$1" = "sandbox" ] && [ "$2" = "list" ]' currently returns "beta Ready" to both the log file and stdout, which masks regressions in sandboxDestroy() because registry.listSandboxes().sandboxes will appear non-empty; change the mock so it still writes "beta Ready" to the "$log_file" but prints an empty list (no sandbox lines) to stdout so registry.listSandboxes() yields an empty array and the test exercises the "other sandboxes exist locally" branch; locate the case block in test/cli.test.js and adjust the two printf calls accordingly (keep the printf to "$log_file" but make the stdout printf produce an empty/blank output).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/cli.test.js`:
- Around line 321-325: The test currently unconditionally reads bashLog (and
similar files) to assert negative conditions, which causes ENOENT when the
helper never ran; change these assertions to first check file existence (e.g.,
fs.existsSync(bashLog)) and only read-and-assert when the file exists; for files
that should be absent you can instead assert that the file does not exist or
guard the negative contains check behind an existence check; apply the same
pattern to the other occurrences mentioned (the block around
cleanupGatewayAfterLastSandbox and the similar assertions at the 383-388 region)
and reference the variables openshellLog and bashLog and the helper
cleanupGatewayAfterLastSandbox when locating the code to update.
---
Nitpick comments:
In `@test/cli.test.js`:
- Around line 295-298: The mock for the shell case 'if [ "$1" = "sandbox" ] && [
"$2" = "list" ]' currently returns "beta Ready" to both the log file and stdout,
which masks regressions in sandboxDestroy() because
registry.listSandboxes().sandboxes will appear non-empty; change the mock so it
still writes "beta Ready" to the "$log_file" but prints an empty list (no
sandbox lines) to stdout so registry.listSandboxes() yields an empty array and
the test exercises the "other sandboxes exist locally" branch; locate the case
block in test/cli.test.js and adjust the two printf calls accordingly (keep the
printf to "$log_file" but make the stdout printf produce an empty/blank output).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1610c86f-ad47-4ae7-9db9-261c602590a5
📒 Files selected for processing (1)
test/cli.test.js
Merge origin/main to resolve conflicts from recent changes: - NVIDIA#1208 core blocker lifecycle regressions - NVIDIA#1200 Prettier formatting - NVIDIA#836 GPU VRAM checks Jetson detection now leverages main's UNIFIED_MEMORY_GPU_TAGS (Orin/Thor/Xavier) with added jetson flag and /proc/device-tree fallback. All 118 tests pass.
Merge origin/main into feat/jetson-orin-nano-support to resolve conflicts from recent changes (NVIDIA#1208, NVIDIA#1200, NVIDIA#836, NVIDIA#1221, NVIDIA#1223). Jetson detection now leverages main's UNIFIED_MEMORY_GPU_TAGS with added jetson flag and /proc/device-tree fallback. All 116 tests pass.
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes #1191 Fixes #1160 Fixes #1182 Fixes #1162 Related #991 ## Notes - `#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - #1106 - #308 - #95 - #770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes NVIDIA#1191 Fixes NVIDIA#1160 Fixes NVIDIA#1182 Fixes NVIDIA#1162 Related NVIDIA#991 ## Notes - `NVIDIA#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - NVIDIA#1106 - NVIDIA#308 - NVIDIA#95 - NVIDIA#770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
NVIDIA_API_KEYfor local-onlynemoclaw startand only gate the Telegram bridge when that bridge actually needs the keynemoclawgateway, andopenshell-cluster-nemoclawDocker volumes when the last sandbox is destroyedGB10while keepingspark: truespecific to GB10version: 1scaffold instead of producing malformed YAMLIssue Mapping
Fixes #1191
Fixes #1160
Fixes #1182
Fixes #1162
Related #991
Notes
#1188was investigated but is not included in this PR.Validation
References Reviewed
Signed-off-by: Kevin Jones kejones@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests