Skip to content

Worker silently fails init when 'node' not in PATH, reports status:ok but stays initialized:false forever #1289

@Dee-0503

Description

@Dee-0503

Problem

worker-cli restart has a hard-coded 10-second readiness timeout. When the worker cold-starts (loading embedding model, etc.), initialization often takes 10–25 seconds. This causes the CLI to report Failed to restart: Readiness check timed out after 10000ms even though the worker may eventually initialize successfully.

Impact

Combined with health-check or scheduled-restart scripts that rely on the CLI's exit code, this creates a cascading failure loop:

  1. Worker loses initialization state (e.g., after macOS sleep/wake)
  2. Health check detects initialized=false, calls worker-cli restart
  3. CLI times out at 10s → reports failure
  4. Worker process is replaced but also can't initialize within 10s
  5. Next hourly health check: same result → restart → timeout → repeat
  6. Loop persisted for ~20 hours until manual intervention

Evidence from logs

[2026-03-05 15:02:22] ALERT: Worker not initialized after 650min (initialized=false), restarting
Failed to restart: Readiness check timed out after 10000ms
[2026-03-05 16:02:13] ALERT: Worker not initialized after 59min (initialized=false), restarting
Failed to restart: Readiness check timed out after 10000ms
[2026-03-05 19:00:01] ALERT: Worker not initialized after 177min (initialized=false), restarting
Failed to restart: Readiness check timed out after 10000ms
[2026-03-05 20:00:00] ALERT: Worker not initialized after 59min (initialized=false), restarting
Failed to restart: Readiness check timed out after 10000ms

Manual restart at 00:01 succeeded — worker initialized in ~23 seconds.

Suggestion

  1. Make the readiness timeout configurable (e.g., worker-cli restart --timeout 30000), defaulting to 30s
  2. Or at minimum, increase the default to 30s — 10s is too aggressive for cold starts with embedding model loading
  3. Consider making worker-cli restart return success if the worker process was successfully spawned, and add a separate worker-cli wait-ready --timeout <ms> command for scripts that need to verify initialization

Environment

  • claude-mem version: 10.0.6
  • Platform: macOS (darwin 24.6.0)
  • Runtime: bun

— cee

Metadata

Metadata

Assignees

No one assigned

    Labels

    root:worker-lifecycleWorker startup, shutdown, zombie processes

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions