fix: hook resilience and worker lifecycle — 87% faster recovery from dead worker#1056
Conversation
Lightweight script for Claude Code statusLineCommand integration. Returns per-project observation and prompt counts via direct SQLite read (~15ms, no HTTP, no worker dependency). Counts are scoped with WHERE project = ? to prevent inflated totals from cross-project observations. Supports CLAUDE_MEM_DATA_DIR from settings.json for custom data directory configurations.
- Check CLAUDE_MEM_DATA_DIR env var before settings.json (Greptile) - Derive project before DB check for consistent output (Greptile) - Include project in error fallback output (Greptile) - Set executable permission for shebang compatibility (Greptile)
thedotmack#923, thedotmack#984, thedotmack#987, thedotmack#1042) Reduce timeouts to eliminate 10-30s startup delay when worker is dead (common on WSL2 after hibernate). Add stale PID detection, graceful error handling across all handlers, and error classification that distinguishes worker unavailability from handler bugs. - HEALTH_CHECK 30s→3s, new POST_SPAWN_WAIT (5s), PORT_IN_USE_WAIT (3s) - isProcessAlive() with EPERM handling, cleanStalePidFile() - getPluginVersion() try-catch for shutdown race (thedotmack#1042) - isWorkerUnavailableError: transport+5xx+429→exit 0, 4xx→exit 2 - No-op handler for unknown event types (thedotmack#984) - Wrap all handler fetch calls in try-catch for graceful degradation - CLAUDE_MEM_HEALTH_TIMEOUT_MS env var override with validation
Greptile OverviewGreptile SummaryThis PR delivers significant performance improvements and graceful degradation for worker lifecycle management. The changes reduce worst-case recovery time from ~60s to ~8s (87% faster) through three key optimizations: reduced timeout constants (HEALTH_CHECK 30s→3s, POST_SPAWN_WAIT 30s→5s, PORT_IN_USE_WAIT 15s→3s), proactive stale PID cleanup via Key improvements:
Implementation quality:
Edge cases handled:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Hook as Hook Command
participant EWS as ensureWorkerStarted()
participant PID as PID File
participant Worker as Worker Process
participant Health as Health Check
Note over Hook,Health: Dead Worker Recovery (87% faster)
Hook->>EWS: Start worker
EWS->>PID: cleanStalePidFile()
PID->>PID: Read PID file
PID->>Worker: isProcessAlive(pid) [signal 0]
Worker-->>PID: ESRCH (dead)
PID->>PID: Remove stale PID file
Note right of PID: Instant cleanup<br/>(was: wait 30s)
EWS->>Health: waitForHealth(1000ms)
Health-->>EWS: Not healthy
EWS->>EWS: Check port in use
EWS->>Health: waitForHealth(3s)
Note right of Health: Reduced from 15s
Health-->>EWS: Port free
EWS->>Worker: spawnDaemon()
Worker->>Worker: Start process
Worker->>PID: Write PID file (after listen())
EWS->>Health: waitForHealth(5s)
Note right of Health: Reduced from 30s
Health->>Worker: GET /health
Worker-->>Health: 200 OK
Health-->>EWS: Healthy
EWS-->>Hook: Worker ready (8s total)
Note over Hook,Health: Before: ~60s<br/>After: ~8s<br/>87% improvement
|
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Address Greptile review: add comment noting that TypeError('fetch failed')
is already handled by transport patterns before the instanceof check.
|
Re: Greptile's review comment on TypeError order dependency: Good catch — applied in 22683f6. The order dependency is subtle: |
# Conflicts: # plugin/scripts/mcp-server.cjs # plugin/scripts/worker-service.cjs
Missing return statement and closing brace in the programming errors check caused a build failure after merging main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: registerSignalHandlers() handled SIGTERM/SIGINT but not SIGHUP. When the parent hook process exits, the kernel sends SIGHUP to the daemon, causing immediate termination (default signal action). Belt-and-suspenders fix: 1. SIGHUP handler: ignore in daemon mode, graceful shutdown otherwise 2. setsid: spawn daemon in new session on Linux (prevents SIGHUP delivery) 3. Global unhandledRejection/uncaughtException guards in daemon mode
Summary
Performance Metrics
Issues Closed
ensureWorkerStarted()timeouts from 30s→3-5s; stale PID cleanup avoids waiting for dead processesgetEventHandler()returns no-op handler for unknown events instead of throwinggetPluginVersion()wrapped in try-catch, returns'unknown'on ENOENT/EBUSY; version check skipped when unknownRelationship to Open PRs
ensureWorkerStarted()— our timeout reductions make their auto-restart path faster too.Changes (17 files)
Core infrastructure
src/shared/hook-constants.tssrc/services/infrastructure/ProcessManager.tsisProcessAlive()(EPERM-aware),cleanStalePidFile(),spawnDaemon()usessetsidon Linuxsrc/shared/worker-utils.tsgetPluginVersion()in try-catch (#1042), addCLAUDE_MEM_HEALTH_TIMEOUT_MSenv overridesrc/services/worker-service.tscleanStalePidFile()at top ofensureWorkerStarted(), use named constants, SIGHUP handler (ignore in daemon mode), unhandled error guards in daemon modesrc/cli/hook-command.tsisWorkerUnavailableError()classifier (transport/5xx→exit 0, 4xx→exit 2)Handler graceful degradation
src/cli/handlers/index.tssrc/cli/handlers/context.tssrc/cli/handlers/observation.ts!response.okinstead of throwsrc/cli/handlers/file-edit.tssrc/cli/handlers/user-message.tssrc/cli/handlers/session-complete.tsensureWorkerRunning()return valueTests
tests/hook-constants.test.tstests/infrastructure/process-manager.test.tsisProcessAlive(),cleanStalePidFile(), EPERM handling,spawnDaemonsetsid, SIGHUP listenertests/hook-command.test.ts(new)Built bundles
plugin/scripts/worker-service.cjsplugin/scripts/mcp-server.cjsDesign Decisions
ensureWorkerStarted()kept in hook command: Peer review (Gemini + Codex) confirmed this is the only code path that starts the worker — removing it would break auto-recovery. The reduced timeouts (5s max vs 30s) eliminate the double-penalty concern.getTimeout()(1.5x) for hook-side fast path,getPlatformTimeout()(2.0x) for worker-side socket operations — clarifying comment added.CLAUDE_MEM_HEALTH_TIMEOUT_MSlets users on unusually slow systems increase the health check timeout without code changes (validated: 500ms–300000ms range).setsidcreates a new session so SIGHUP is never delivered, (3)uncaughtException/unhandledRejectionhandlers prevent silent crashes from any source.Test plan
bun test— 935 pass, 3 skip (1 pre-existing timeout in openclaw SSE test, unrelated)npm run build— builds successfullynpm run build-and-sync— synced to marketplace, worker restarted/exitsession → stop hooks complete without BLOCKING_ERRORCLAUDE_MEM_HEALTH_TIMEOUT_MS=10000 claude→ verify override worksbun worker-service.cjs start→ daemon survives parent exit (port stays bound)kill -HUP <worker-pid>→ worker logs "Ignoring SIGHUP" and stays alive