Skip to content

fix: address core blocker lifecycle regressions#1208

Merged
kjw3 merged 9 commits intomainfrom
fix/core-blockers-gpu-policy-lifecycle
Mar 31, 2026
Merged

fix: address core blocker lifecycle regressions#1208
kjw3 merged 9 commits intomainfrom
fix/core-blockers-gpu-policy-lifecycle

Conversation

@kjw3
Copy link
Copy Markdown
Contributor

@kjw3 kjw3 commented Mar 31, 2026

Summary

  • stop requiring NVIDIA_API_KEY for local-only nemoclaw start and only gate the Telegram bridge when that bridge actually needs the key
  • clean up the dashboard forward, nemoclaw gateway, and openshell-cluster-nemoclaw Docker volumes when the last sandbox is destroyed
  • broaden unified-memory NVIDIA GPU detection beyond GB10 while keeping spark: true specific to GB10
  • harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean version: 1 scaffold instead of producing malformed YAML

Issue Mapping

Fixes #1191
Fixes #1160
Fixes #1182
Fixes #1162
Related #991

Notes

  • #1188 was investigated but is not included in this PR.
  • The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix.
  • Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage.

Validation

npx vitest run
npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js
npx tsc -p jsconfig.json --noEmit

References Reviewed

Signed-off-by: Kevin Jones kejones@nvidia.com

Summary by CodeRabbit

  • New Features

    • Core services can start without an NVIDIA API key.
    • Enhanced unified‑memory GPU detection with more accurate capability reporting.
  • Bug Fixes

    • Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain.
    • Telegram bridge now starts only when both required tokens are present; clearer startup warnings.
    • Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting.
  • Tests

    • Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a94f89ee-e115-4da9-abea-3dd6fb115dff

📥 Commits

Reviewing files that changed from the base of the PR and between ba10195 and 872636f.

📒 Files selected for processing (1)
  • test/cli.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/cli.test.js

📝 Walkthrough

Walkthrough

Adds unified-memory NVIDIA GPU detection with an exported memory-based NIM capability helper, tightens policy YAML extraction/normalization, performs OpenShell gateway cleanup when the last sandbox is destroyed, and relaxes startup failure for missing NVIDIA_API_KEY while requiring both TELEGRAM_BOT_TOKEN and NVIDIA_API_KEY to start the Telegram bridge.

Changes

Cohort / File(s) Summary
GPU detection & NIM capability
bin/lib/nim.js, test/nim.test.js
Exported canRunNimWithMemory(totalMemoryMB); refactored detection to parse all nvidia-smi GPU names, identify unified-memory tags, compute per-GPU MB, set unifiedMemory, spark, name, count, and derive nimCapable via the new helper. Added unit tests for unified-memory cases.
Policy parsing & merging
bin/lib/policies.js, test/policies.test.js
parseCurrentPolicy(raw) now extracts the post---- candidate, rejects status/error-like or non-YAML-shaped inputs, and returns "" for parse failures or non-object results. mergePresetIntoPolicy merges against the normalized policy and prefixes synthesized headers with version: 1\n\n. Tests added for metadata-only and formatting cases.
Gateway lifecycle / sandbox destroy
bin/nemoclaw.js, test/cli.test.js
Added constants/helpers to stop forwarded ports and destroy the named OpenShell gateway; sandboxDestroy() triggers gateway cleanup only when deletion succeeded, local registry became empty, and the gateway reports no live sandboxes. CLI tests validate teardown occurs only when destroying the last sandbox.
Service startup & env gating
scripts/start-services.sh, test/service-env.test.js
Removed fatal abort for unset NVIDIA_API_KEY; added conditional warnings and changed Telegram bridge startup so it requires both TELEGRAM_BOT_TOKEN and NVIDIA_API_KEY. Integration tests added to validate messaging and gating.
Tests & minor edits
test/service-env.test.js, test/cli.test.js, test/nim.test.js, test/policies.test.js
Added/updated unit and integration tests to cover the above behaviors; minor import and fixture adjustments across tests.

Sequence Diagram(s)

sequenceDiagram
  participant Detector as nim.detectGpu
  participant Runner as runner.runCapture
  participant NVSMI as "nvidia-smi"
  participant SysMem as "free -m"
  participant Logic as canRunNimWithMemory

  Detector->>Runner: runCapture("nvidia-smi --query-gpu=name --format=csv,noheader")
  Runner-->>Detector: GPU names list
  Detector->>Runner: runCapture("free -m")
  Runner-->>Detector: total memory MB
  Detector->>Logic: call canRunNimWithMemory(totalMemoryMB)
  Logic-->>Detector: nimCapable (true/false)
  Detector-->>Detector: set unifiedMemory, spark, name, count, perGpuMB
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I sniffed the GPUs, counted each name,

Split memory slices, then called the game,
Policies cleaned, gateways softly closed,
Bridges wait for keys until exposed,
A hop, a patch — the system hums again.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix: address core blocker lifecycle regressions' is vague and generic, failing to communicate the specific changes (API key gating, gateway cleanup, GPU detection, policy handling). Consider a more descriptive title like 'fix: remove API key requirement, clean up gateway on destroy, improve GPU detection' that highlights the main technical changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed All primary coding requirements from issues #1191, #1160, #1182, and #1162 are comprehensively addressed with corresponding implementation changes across all modified files.
Out of Scope Changes check ✅ Passed All changes are directly aligned with linked issues; no extraneous modifications or unrelated features detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/core-blockers-gpu-policy-lifecycle

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/policies.js`:
- Around line 78-86: The candidate extraction currently uses regex-only checks
and can allow truncated/invalid YAML (e.g., "version: 1\nnetwork_policies") to
be treated as valid; instead, after computing candidate in the same block (the
variable named candidate), attempt to parse it as YAML (using the project's YAML
parser used elsewhere) and if parsing fails or the result is not an
object/expected structure, return "" so mergePresetIntoPolicy() won't reuse
malformed YAML; keep the existing regex guards but add this strict syntactic
validation step before returning candidate.

In `@bin/nemoclaw.js`:
- Around line 1100-1105: The current cleanup calls
cleanupGatewayAfterLastSandbox based only on registry.removeSandbox and
registry.listSandboxes; instead, ensure the openshell deletion actually
succeeded and confirm there are no live sandboxes reported by openshell before
tearing down the shared gateway: capture the result/exit status of
runOpenshell(["sandbox","delete", sandboxName]) and only proceed if it indicates
success, then run the equivalent of openshell sandbox list (e.g., via
runOpenshell(["sandbox","list"]) or the helper that lists live sandboxes) and
verify its output shows zero live sandboxes; only when both the delete succeeded
and the live openshell list is empty should you call
cleanupGatewayAfterLastSandbox (keep using
registry.removeSandbox/registry.listSandboxes as local updates but do not gate
teardown on them alone).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fb24de21-f522-42b8-bafb-0e77ec27eba0

📥 Commits

Reviewing files that changed from the base of the PR and between 08cccd4 and c9485d4.

📒 Files selected for processing (8)
  • bin/lib/nim.js
  • bin/lib/policies.js
  • bin/nemoclaw.js
  • scripts/start-services.sh
  • test/cli.test.js
  • test/nim.test.js
  • test/policies.test.js
  • test/service-env.test.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
bin/nemoclaw.js (1)

1109-1120: ⚠️ Potential issue | 🔴 Critical

Only remove the local sandbox after a successful OpenShell delete.

Line 1111 still drops the registry entry when openshell sandbox delete fails, and Line 1120 still reports success. A transient delete failure will leave the sandbox running but make it disappear locally, which also blocks a clean retry.

💡 Suggested fix
   const deleteResult = runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true });
+  if (deleteResult.status !== 0) {
+    exitWithSpawnResult(deleteResult);
+  }
 
   const removed = registry.removeSandbox(sandboxName);
   if (
-    deleteResult.status === 0 &&
     removed &&
     registry.listSandboxes().sandboxes.length === 0 &&
     hasNoLiveSandboxes()
   ) {
     cleanupGatewayAfterLastSandbox();
   }
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/cli.test.js`:
- Around line 357-360: The test mock for the "sandbox list" command is missing
the table header so hasNoLiveSandboxes() (which calls parseLiveSandboxNames())
skips the only line and finds no sandboxes; update the mocked output to include
the same header parseLiveSandboxNames() expects (e.g., the "NAME STATUS" or
exact header string used by parseLiveSandboxNames()) as the first printed line
followed by the live sandbox row ("beta Ready") so the parser will recognize a
live sandbox in the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98d219de-6863-429b-8f68-0608dc862207

📥 Commits

Reviewing files that changed from the base of the PR and between c9485d4 and 587d5b1.

📒 Files selected for processing (4)
  • bin/lib/policies.js
  • bin/nemoclaw.js
  • test/cli.test.js
  • test/policies.test.js
✅ Files skipped from review due to trivial changes (1)
  • test/policies.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/policies.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/cli.test.js (1)

357-360: ⚠️ Potential issue | 🟡 Minor

Emit the real sandbox list header in this mock.

hasNoLiveSandboxes() parses sandbox list output, and this fixture only prints beta Ready. If the parser still treats the first row as the table header, this case models zero live sandboxes and can drop into the cleanup path instead of the “keep runtime” branch.

💡 Suggested fix
         'if [ "$1" = "sandbox" ] && [ "$2" = "list" ]; then',
-        '  printf "beta Ready\\n" >> "$log_file"',
-        '  printf "beta Ready\\n"',
+        '  printf "NAME STATUS\\nbeta Ready\\n" >> "$log_file"',
+        '  printf "NAME STATUS\\nbeta Ready\\n"',
         "  exit 0",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 357 - 360, The test mock for the `sandbox
list` command only prints "beta Ready", which causes hasNoLiveSandboxes() to
misinterpret that row as the table header; update the mocked output in the
test/cli.test.js snippet (the shell block that prints to "$log_file") to emit
the real `sandbox list` header line expected by hasNoLiveSandboxes() (e.g., the
same column header string used by the parser) before printing the "beta Ready"
row so the parser treats it as data and the test follows the “keep runtime”
branch.
🧹 Nitpick comments (1)
test/cli.test.js (1)

321-325: Assert the local-registry short-circuit explicitly.

This case only proves cleanup was skipped. It would still pass if sandboxDestroy() queried sandbox list even when local sandboxes still remained, so it doesn’t actually lock in the branch split added around registry.listSandboxes().sandboxes.length === 0.

💡 Suggested assertion
     expect(r.code).toBe(0);
     expect(fs.readFileSync(openshellLog, "utf8")).toContain("sandbox delete alpha");
+    expect(fs.readFileSync(openshellLog, "utf8")).not.toContain("NAME STATUS");
     expect(fs.readFileSync(openshellLog, "utf8")).not.toContain("forward stop 18789");
     expect(fs.readFileSync(openshellLog, "utf8")).not.toContain("gateway destroy -g nemoclaw");
     expect(fs.readFileSync(bashLog, "utf8")).not.toContain("docker volume ls -q --filter");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 321 - 325, The test currently only checks
cleanup was skipped but doesn't ensure the local-registry short-circuit branch
in sandboxDestroy was taken; add an explicit assertion that
registry.listSandboxes() was not invoked (or was not awaited) to prove the
short-circuit around registry.listSandboxes().sandboxes.length === 0 was used.
Instrument the test to spy/mock the registry object's listSandboxes method (the
same registry used by sandboxDestroy) before calling the CLI path, then after
execution assert the spy was not called (e.g.,
expect(listSandboxesSpy).not.toHaveBeenCalled()) to lock in the branch behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/cli.test.js`:
- Around line 357-360: The test mock for the `sandbox list` command only prints
"beta Ready", which causes hasNoLiveSandboxes() to misinterpret that row as the
table header; update the mocked output in the test/cli.test.js snippet (the
shell block that prints to "$log_file") to emit the real `sandbox list` header
line expected by hasNoLiveSandboxes() (e.g., the same column header string used
by the parser) before printing the "beta Ready" row so the parser treats it as
data and the test follows the “keep runtime” branch.

---

Nitpick comments:
In `@test/cli.test.js`:
- Around line 321-325: The test currently only checks cleanup was skipped but
doesn't ensure the local-registry short-circuit branch in sandboxDestroy was
taken; add an explicit assertion that registry.listSandboxes() was not invoked
(or was not awaited) to prove the short-circuit around
registry.listSandboxes().sandboxes.length === 0 was used. Instrument the test to
spy/mock the registry object's listSandboxes method (the same registry used by
sandboxDestroy) before calling the CLI path, then after execution assert the spy
was not called (e.g., expect(listSandboxesSpy).not.toHaveBeenCalled()) to lock
in the branch behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: acbb22b6-c4c9-4023-9a5b-28a7dd0ad0b8

📥 Commits

Reviewing files that changed from the base of the PR and between 587d5b1 and 66a8ec2.

📒 Files selected for processing (1)
  • test/cli.test.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/cli.test.js (1)

295-298: Isolate the local-registry guard in this case.

This mock also makes sandbox list return beta Ready, so cleanup stays skipped even if the registry.listSandboxes().sandboxes.length === 0 check in sandboxDestroy() regresses. Returning an empty live list here would make this test actually pin the “other sandboxes still exist locally” branch.

♻️ Suggested tweak
         'if [ "$1" = "sandbox" ] && [ "$2" = "list" ]; then',
-        '  printf "NAME STATUS\\nbeta Ready\\n" >> "$log_file"',
-        '  printf "NAME STATUS\\nbeta Ready\\n"',
+        '  printf "NAME STATUS\\n" >> "$log_file"',
+        '  printf "NAME STATUS\\n"',
         "  exit 0",
         "fi",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 295 - 298, The mock for the shell case 'if [
"$1" = "sandbox" ] && [ "$2" = "list" ]' currently returns "beta Ready" to both
the log file and stdout, which masks regressions in sandboxDestroy() because
registry.listSandboxes().sandboxes will appear non-empty; change the mock so it
still writes "beta Ready" to the "$log_file" but prints an empty list (no
sandbox lines) to stdout so registry.listSandboxes() yields an empty array and
the test exercises the "other sandboxes exist locally" branch; locate the case
block in test/cli.test.js and adjust the two printf calls accordingly (keep the
printf to "$log_file" but make the stdout printf produce an empty/blank output).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/cli.test.js`:
- Around line 321-325: The test currently unconditionally reads bashLog (and
similar files) to assert negative conditions, which causes ENOENT when the
helper never ran; change these assertions to first check file existence (e.g.,
fs.existsSync(bashLog)) and only read-and-assert when the file exists; for files
that should be absent you can instead assert that the file does not exist or
guard the negative contains check behind an existence check; apply the same
pattern to the other occurrences mentioned (the block around
cleanupGatewayAfterLastSandbox and the similar assertions at the 383-388 region)
and reference the variables openshellLog and bashLog and the helper
cleanupGatewayAfterLastSandbox when locating the code to update.

---

Nitpick comments:
In `@test/cli.test.js`:
- Around line 295-298: The mock for the shell case 'if [ "$1" = "sandbox" ] && [
"$2" = "list" ]' currently returns "beta Ready" to both the log file and stdout,
which masks regressions in sandboxDestroy() because
registry.listSandboxes().sandboxes will appear non-empty; change the mock so it
still writes "beta Ready" to the "$log_file" but prints an empty list (no
sandbox lines) to stdout so registry.listSandboxes() yields an empty array and
the test exercises the "other sandboxes exist locally" branch; locate the case
block in test/cli.test.js and adjust the two printf calls accordingly (keep the
printf to "$log_file" but make the stdout printf produce an empty/blank output).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1610c86f-ad47-4ae7-9db9-261c602590a5

📥 Commits

Reviewing files that changed from the base of the PR and between 66a8ec2 and ba10195.

📒 Files selected for processing (1)
  • test/cli.test.js

@kjw3 kjw3 merged commit a308420 into main Mar 31, 2026
10 checks passed
@kjw3 kjw3 deleted the fix/core-blockers-gpu-policy-lifecycle branch March 31, 2026 23:39
realkim93 added a commit to realkim93/NemoClaw that referenced this pull request Apr 1, 2026
Merge origin/main to resolve conflicts from recent changes:
- NVIDIA#1208 core blocker lifecycle regressions
- NVIDIA#1200 Prettier formatting
- NVIDIA#836 GPU VRAM checks

Jetson detection now leverages main's UNIFIED_MEMORY_GPU_TAGS
(Orin/Thor/Xavier) with added jetson flag and /proc/device-tree
fallback. All 118 tests pass.
realkim93 added a commit to realkim93/NemoClaw that referenced this pull request Apr 1, 2026
Merge origin/main into feat/jetson-orin-nano-support to resolve
conflicts from recent changes (NVIDIA#1208, NVIDIA#1200, NVIDIA#836, NVIDIA#1221, NVIDIA#1223).

Jetson detection now leverages main's UNIFIED_MEMORY_GPU_TAGS
with added jetson flag and /proc/device-tree fallback.

All 116 tests pass.
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
## Summary
- stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and
only gate the Telegram bridge when that bridge actually needs the key
- clean up the dashboard forward, `nemoclaw` gateway, and
`openshell-cluster-nemoclaw` Docker volumes when the last sandbox is
destroyed
- broaden unified-memory NVIDIA GPU detection beyond `GB10` while
keeping `spark: true` specific to GB10
- harden policy merge/retry behavior so truncated or error-like
current-policy reads rebuild from a clean `version: 1` scaffold instead
of producing malformed YAML

## Issue Mapping
Fixes #1191
Fixes #1160
Fixes #1182
Fixes #1162
Related #991

## Notes
- `#1188` was investigated but is not included in this PR.
- The current evidence still points to a deeper runtime / proxy
reachability problem on macOS + Colima rather than a bounded
NemoClaw-only fix.
- Keeping it out of this branch avoids speculative networking changes
without strong reproduction and cross-platform coverage.

## Validation
```bash
npx vitest run
npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js
npx tsc -p jsconfig.json --noEmit
```

## References Reviewed
- #1106
- #308
- #95
- #770

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  - Core services can start without an NVIDIA API key.
- Enhanced unified‑memory GPU detection with more accurate capability
reporting.

* **Bug Fixes**
- Gateway and forwarded‑port cleanup only runs when the last sandbox is
removed and no live sandboxes remain.
- Telegram bridge now starts only when both required tokens are present;
clearer startup warnings.
- Policy parsing/merge more robust for metadata‑only or malformed
inputs; consistent version header formatting.

* **Tests**
- Added tests covering GPU detection, policy parsing/merge, CLI
sandbox/gateway flows, and service startup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
## Summary
- stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and
only gate the Telegram bridge when that bridge actually needs the key
- clean up the dashboard forward, `nemoclaw` gateway, and
`openshell-cluster-nemoclaw` Docker volumes when the last sandbox is
destroyed
- broaden unified-memory NVIDIA GPU detection beyond `GB10` while
keeping `spark: true` specific to GB10
- harden policy merge/retry behavior so truncated or error-like
current-policy reads rebuild from a clean `version: 1` scaffold instead
of producing malformed YAML

## Issue Mapping
Fixes NVIDIA#1191
Fixes NVIDIA#1160
Fixes NVIDIA#1182
Fixes NVIDIA#1162
Related NVIDIA#991

## Notes
- `NVIDIA#1188` was investigated but is not included in this PR.
- The current evidence still points to a deeper runtime / proxy
reachability problem on macOS + Colima rather than a bounded
NemoClaw-only fix.
- Keeping it out of this branch avoids speculative networking changes
without strong reproduction and cross-platform coverage.

## Validation
```bash
npx vitest run
npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js
npx tsc -p jsconfig.json --noEmit
```

## References Reviewed
- NVIDIA#1106
- NVIDIA#308
- NVIDIA#95
- NVIDIA#770

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  - Core services can start without an NVIDIA API key.
- Enhanced unified‑memory GPU detection with more accurate capability
reporting.

* **Bug Fixes**
- Gateway and forwarded‑port cleanup only runs when the last sandbox is
removed and no live sandboxes remain.
- Telegram bridge now starts only when both required tokens are present;
clearer startup warnings.
- Policy parsing/merge more robust for metadata‑only or malformed
inputs; consistent version header formatting.

* **Tests**
- Added tests covering GPU detection, policy parsing/merge, CLI
sandbox/gateway flows, and service startup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants