-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Summary
Migrate the NemoClaw CLI from ~4,700 lines of shell + ~4,200 lines of untyped CJS to strict ESM TypeScript on oclif, delivered across 6 sequential PRs. This subsumes #909 (oclif migration) and addresses the same 24 issues and 17 superseded PRs listed there.
No user-facing behavioral changes. The CLI commands, flags, and output stay the same — the internals become typed, tested, and structurally sound.
Motivation
The current CLI has three categories of problem that compound each other:
Shell duplication. Three copies of the install logic (install.sh, scripts/install.sh, inline in onboard.js). A dead setup.sh that leaks NVIDIA_API_KEY on the command line. An uninstall path that exists entirely outside the JS test suite. ~4,700 lines of shell that mostly duplicate logic already in bin/lib/*.js.
No type safety. The 12 modules in bin/lib/ are untyped CJS. The CLI and plugin share domain concepts (sandbox entries, onboard config, credential shapes, endpoint types) but have no shared type definitions — the same shapes are implicit in JS and explicit in TS, with no compile-time guarantee they agree. Silent bugs like registerSandbox({ naem: "foo" }) store name: undefined with no complaint.
No CLI framework. The 515-line switch/case dispatcher in bin/nemoclaw.js has no shell completions, no per-command --help, no flag validation, no --json output, no structured error handling. Fixing these individually is less efficient than fixing the architecture.
Architecture
oclif commands live at the repo root. The OpenClaw plugin (nemoclaw/src/) is a separate concern (Commander-based, different runtime) and is unchanged.
src/
├── commands/ oclif command classes (one file per command)
│ ├── onboard.ts
│ ├── deploy.ts
│ ├── list.ts
│ ├── status.ts
│ ├── start.ts / stop.ts
│ ├── debug.ts / uninstall.ts
│ └── sandbox/
│ ├── connect.ts
│ ├── status.ts
│ ├── logs.ts
│ ├── destroy.ts
│ └── policy/
│ ├── add.ts
│ └── list.ts
├── lib/ CLI library modules (converted from bin/lib/)
│ ├── runner.ts Subprocess execution via execa ($$ helper)
│ ├── registry.ts Multi-sandbox state (~/.nemoclaw/sandboxes.json)
│ ├── credentials.ts API key and token management
│ ├── platform.ts OS/arch/Docker socket detection
│ ├── uninstall.ts Uninstall steps (Docker, npm, state)
│ ├── services.ts PID management for helper services
│ ├── debug.ts Diagnostic collection with secret redaction
│ ├── onboard.ts Onboard wizard logic
│ └── ...
└── hooks/
└── command-not-found.ts Backward compat: nemoclaw <sandbox> <action>
shared/
└── types.ts Domain types shared between CLI and plugin
nemoclaw/
└── src/ OpenClaw plugin (unchanged)
Two TypeScript projects (tsconfig.cli.json at root + nemoclaw/tsconfig.json) with shared/ bridging them.
Plan
6 sequential PRs. Each builds on the last.
flowchart TD
PR0["**PR 0** — Foundation ✅ #913\ntsconfig.cli.json · root execa · TS coverage ratchet"]
PR1["**PR 1** — Security fix + first TS batch\nKill setup.sh · shared/types.ts\nrunner · registry · credentials · platform → src/lib/"]
PR2["**PR 2** — Port shell scripts to TS\nuninstall.sh · start-services.sh · debug.sh → src/lib/\n+ nim · policies · inference-config · local-inference"]
PR3["**PR 3** — oclif scaffold + simple commands\nScaffold oclif · list · status · start · stop · debug · uninstall\nMerge installers · port fix-coredns · convert onboard + preflight"]
PR4["**PR 4** — oclif complex commands\nSandbox topic commands · deploy · onboard wizard\nCentralized error handling · secret redaction"]
PR5["**PR 5** — E2E migration + cleanup\nAll e2e → vitest TS · shell completions\nDelete old dispatcher · strict TS everywhere"]
PR0 --> PR1
PR1 --> PR2
PR2 --> PR3
PR3 --> PR4
PR4 --> PR5
style PR0 fill:#2d6a2d,color:#fff
PR 0: Foundation ✅ #913
Done. tsconfig.cli.json (strict, noEmit), execa ^9.6.1 in root devDependencies, scripts/check-coverage-ratchet.ts replacing the bash+python version, tsc-check-cli pre-push hook, CI updated.
PR 1: Kill setup.sh + shared types + first CJS→ESM TS batch
Security fix. scripts/setup.sh passes NVIDIA_API_KEY on the openshell sandbox create command line. onboard.js handles this through the gateway's stored credential. The shell path must die.
shared/types.ts— shared domain types with at least one consumer from day one:
SandboxEntry,SandboxRegistry,EndpointType,NemoClawOnboardConfig,PlatformInfo,CredentialKey- Delete
scripts/setup.sh(242 lines) - Convert 4 core modules (CJS→ESM TS in one step, straight to
src/lib/):bin/lib/runner.js→src/lib/runner.ts— rewrite on execa ($$project helper, deleteshellQuote())bin/lib/registry.js→src/lib/registry.tsbin/lib/credentials.js→src/lib/credentials.tsbin/lib/platform.js→src/lib/platform.ts
~240 lines of shell deleted, 4 modules converted. Risk: Low.
PR 2: Port uninstall + start-services + debug to TS
All three are shell scripts that nemoclaw.js shells out to via spawnSync('bash', ...). Each becomes a TS module using execa, called directly.
uninstall.sh(555 lines) →src/lib/uninstall.tsscripts/start-services.sh(204 lines) →src/lib/services.tsscripts/debug.sh(328 lines) →src/lib/debug.ts- Convert remaining lib modules:
nim,policies,inference-config,local-inference
~1,090 lines of shell → ~620 lines of TS + ~400 lines of tests. Risk: Medium.
PR 3: oclif scaffold + simple commands
The biggest structural change. Scaffold oclif at repo root and migrate the simple commands.
- Add
@oclif/core,@oclif/plugin-help,@oclif/plugin-autocomplete,@oclif/plugin-update - Update
tsconfig.cli.jsonfor compiled output (outDir,declaration, dropnoEmit) - Migrate simple commands:
list,status,start,stop,debug,uninstall,setup-spark - Typed flags/args with validation (MEDIUM: Unvalidated Instance Name in
deploy()Shell Commands #575), auto-generated--help(feat(cli): add per-command --help for all top-level and sandbox-scoped commands #757),enableJsonFlag(feat(cli): add --json output for list and status commands #753),--verbose/--debugglobals (feat: add --verbose / --debug flag for CLI observability #666) command_not_foundhook for backward compat (nemoclaw <sandbox> <action>still works)- Port
fix-coredns.sh→src/lib/fix-coredns.ts - Merge
install.sh+scripts/install.sh→ single installer (~200 lines) - Convert remaining lib modules:
onboard.js,preflight.js,resolve-openshell.js
Old dispatcher (bin/nemoclaw.js) stays as fallback during this PR. Risk: Medium-high.
PR 4: oclif sandbox commands + onboard wizard + deploy
- Sandbox-scoped commands under
sandbox/topic:connect,status,logs,destroy,policy add/list - Migrate
deploy(SSH + cloud provider logic) - Migrate onboard wizard (~900 lines, most complex command) with
@inquirer/prompts - Replace
process.exit()calls with thrown errors (~15-20 call sites) - Centralized error handling in Command base class: secret redaction (security: redact secret patterns from CLI log and error output #664, CRITICAL: NVIDIA API Key Exposed in Process Arguments and Terminal Output #579), structured error classes
Risk: High for onboard wizard. Test interactive flow end-to-end.
PR 5: E2E test migration + completions + cleanup
- Delete old dispatcher (
bin/nemoclaw.js, 515 lines) andscripts/lib/runtime.sh(229 lines) - Shell completions via
@oclif/plugin-autocomplete(bash/zsh/PowerShell; fish needs custom work) (feat: add shell completion for nemoclaw CLI #155) - Auto-generated docs via
oclif readme(ci: add CLI/docs drift test for documented commands and slash subcommands #756, docs(cli): update command reference to match implemented slash commands and host commands #758) - All e2e tests → vitest TS (execa +
waitForhelper replaces bashexpectscripts):
e2e-test.sh,test-full-e2e.sh,test-double-onboard.sh,e2e-gateway-isolation.sh,e2e-cloud-experimental/tree (~2,800 lines),brev-e2e.test.js - Delete legacy installer fallback, shell originals
strict: trueeverywhere,allowJs: false, full coverage ratchet
Risk: Medium. Run old and new e2e in parallel for one PR cycle.
End state
pie title Lines of code by language (after)
"TypeScript (CLI commands)" : 1500
"TypeScript (CLI lib)" : 3000
"TypeScript (plugin)" : 4800
"TypeScript (shared)" : 80
"TypeScript (e2e tests)" : 700
"Shell (stays)" : 1300
"Python (1 script)" : 100
| Before | After |
|---|---|
| ~9,600 lines of shell | ~1,300 (curl-pipe installers, cloud provisioning, CI utilities) |
| ~4,200 lines of untyped CJS | 0 |
| ~4,100 lines of shell e2e tests | vitest TS with waitFor helper |
| No CLI framework | oclif (auto-help, completions, --json, flag validation) |
| No shared types | shared/types.ts — one definition, two consumers |
bash -c + shellQuote() |
execa array args (no shell, no injection) |
Shell that stays (and why)
| Script | Lines | Why |
|---|---|---|
install.sh |
~200 | Must be curl-pipeable — no Node.js available yet |
uninstall.sh |
~30 | Stub for users who curl-pipe the uninstaller |
brev-setup.sh |
~167 | Linux VM provisioning (apt-get, sudo, systemd) |
nemoclaw-start.sh |
~220 | Sandbox entrypoint with inline Python heredocs |
scripts/setup-spark.sh |
~142 | DGX Spark cgroup/Docker fix, runs under sudo -E |
| Other CI/dev utilities | ~540 | check-spdx-headers.sh, backup-workspace.sh, walkthrough.sh, etc. |
Subprocess execution: execa replaces child_process
All subprocess calls move from string-concatenated bash -c commands to execa array args. A project-wide $$ helper in src/lib/runner.ts provides defaults.
// Before: string concat + bash -c + shellQuote
run(`docker rm -f ${shellQuote(container)} 2>/dev/null || true`, { ignoreError: true });
const out = runCapture(`docker ps --format '{{.Names}}'`);
// After: execa — no shell, no quoting, typed
await $$({ reject: false })`docker rm -f ${container}`;
const { stdout } = await $$({ stdio: "pipe" })`docker ps --format ${"{{.Names}}"}`;Conflict management
Must merge before PR 1 (large changes to files we're rewriting):
- fix: harden installer and onboard resiliency #770 — 1,763 additions to
onboard.js - fix: add binaries to policy presets + safe config I/O (#676, #692, #606) #782 + fix: add local-inference policy preset for Ollama/vLLM host access (#693) #781 — touch
runner.js,policies.js,credentials.js,registry.js - feat: add Podman as supported container runtime for macOS and Linux #819 — Podman support in
onboard.js+platform.js
Should merge if ready (smaller onboard.js changes):
- fix: bake CHAT_UI_URL into Docker build for remote dashboard access #812, fix: add fork bomb protection via process limits #811, fix: remove gcc and netcat from sandbox image (#807, #808) #810, fix: onboard fails on GPUs with insufficient VRAM for local NIM #836, fix(onboard): pre-select suggested policy presets in visual display #777, fix(onboard): exclude .venv and caches from sandbox build context #776
Superseded by oclif migration (close before PR 3):
- feat(cli): add shell completion for bash, zsh, and fish #895, feat: add shell completion for nemoclaw CLI #160 (completions), docs(cli): sync command reference with implemented CLI surface #901, docs(policy): fix openshell policy set CLI examples #848 (docs sync), feat(cli): add --json flag to nemoclaw list #182 (
--json), fix(cli): use openshell top-level logs command #350 (routing), fix(cli): validate deploy instance names before setup prompts #604 (validation), feat(cli): add self-update command #644 (self-update), fix(security): redact secret patterns from CLI log and error output #672, security: redact secret patterns from CLI log and error output #794 (secret redaction), fix(cli): respect defaultValue in promptOrDefault interactive mode (Fixes #360) #631, fix: use correct --type flag for brev create command #477, fix: improve activeModelEntries consistency and validation #853, fix(blueprint): add error handling for JSON.parse in state, config, and migration loaders #638, Add error handling for gateway startup #398 (error handling)
See #909 for the full list of 24 issues addressed and 17 PRs superseded.
UX compatibility
The nemoclaw <sandbox-name> <action> syntax is preserved via a command_not_found hook — no breaking changes to existing workflows or scripts. The new nemoclaw sandbox <action> <name> syntax is also available.
/cc @cv @HagegeR @prekshivyas