Skip to content

fix: hook resilience and worker lifecycle — 87% faster recovery from dead worker#1056

Merged
thedotmack merged 8 commits intothedotmack:mainfrom
rodboev:fix/hook-resilience-worker-lifecycle
Feb 11, 2026
Merged

fix: hook resilience and worker lifecycle — 87% faster recovery from dead worker#1056
thedotmack merged 8 commits intothedotmack:mainfrom
rodboev:fix/hook-resilience-worker-lifecycle

Conversation

@rodboev
Copy link
Contributor

@rodboev rodboev commented Feb 10, 2026

Summary

  • 87% faster startup when worker is dead — worst-case recovery drops from ~60s to ~8s by reducing health check timeout (30s→3s), post-spawn wait (30s→5s), and port-in-use wait (15s→3s), plus proactive stale PID cleanup
  • Graceful degradation across all hooks — worker unavailability now returns exit 0 instead of BLOCKING_ERROR, preventing error dialogs during sleep/hibernate recovery
  • Error classifier distinguishes transport/5xx (graceful) from 4xx/programming errors (blocking), ensuring real bugs still surface while transient failures degrade silently
  • Fix daemon silent death — SIGHUP handler + setsid + unhandled error guards prevent the worker daemon from dying silently after parent hook process exits

Performance Metrics

Scenario Before After Improvement
Worker dead, stale PID file ~60s (30s health timeout + 30s spawn wait) ~8s (instant PID cleanup + 3s check + 5s spawn) 87% faster
Worker dead, no PID file ~60s (30s timeout + 30s wait) ~8s (3s timeout + 5s wait) 87% faster
Port occupied, worker unresponsive ~15s wait ~3s wait 80% faster
Worker already healthy ~1s ~1s No change needed
Stop hook with dead worker BLOCKING_ERROR visible to user Silent graceful exit 100% fewer visible errors
Daemon dies from SIGHUP after spawn Always (default signal = terminate) Never (SIGHUP ignored in daemon mode) Root cause fix

Issues Closed

Issue Title Fix
#957 SessionStart hook chain unnecessarily slow Reduced ensureWorkerStarted() timeouts from 30s→3-5s; stale PID cleanup avoids waiting for dead processes
#923 Slow Plugin Loading Time on Claude Startup Root cause was the 30s HEALTH_CHECK and 30s POST_SPAWN_WAIT constants; now 3s and 5s respectively
#984 Stop hook fails with "Unknown event type: session-complete" getEventHandler() returns no-op handler for unknown events instead of throwing
#987 Stop hook causes infinite session loop on summarize failure Error classifier treats summarize HTTP failures as graceful (exit 0), preventing BLOCKING_ERROR from retriggering
#1042 ENOENT error in stop hooks: race condition reading package.json getPluginVersion() wrapped in try-catch, returns 'unknown' on ENOENT/EBUSY; version check skipped when unknown

Relationship to Open PRs

Changes (17 files)

Core infrastructure

File Change
src/shared/hook-constants.ts HEALTH_CHECK 30s→3s, add POST_SPAWN_WAIT (5s), PORT_IN_USE_WAIT (3s)
src/services/infrastructure/ProcessManager.ts Add isProcessAlive() (EPERM-aware), cleanStalePidFile(), spawnDaemon() uses setsid on Linux
src/shared/worker-utils.ts Wrap getPluginVersion() in try-catch (#1042), add CLAUDE_MEM_HEALTH_TIMEOUT_MS env override
src/services/worker-service.ts Call cleanStalePidFile() at top of ensureWorkerStarted(), use named constants, SIGHUP handler (ignore in daemon mode), unhandled error guards in daemon mode
src/cli/hook-command.ts Add isWorkerUnavailableError() classifier (transport/5xx→exit 0, 4xx→exit 2)

Handler graceful degradation

File Change
src/cli/handlers/index.ts Return no-op handler for unknown event types (#984)
src/cli/handlers/context.ts Wrap fetch in try-catch, return empty context on failure
src/cli/handlers/observation.ts Return graceful on !response.ok instead of throw
src/cli/handlers/file-edit.ts Same pattern as observation
src/cli/handlers/user-message.ts Add worker check + wrap fetch in try-catch
src/cli/handlers/session-complete.ts Handle ensureWorkerRunning() return value

Tests

File Change
tests/hook-constants.test.ts Updated assertions for new constant values
tests/infrastructure/process-manager.test.ts Tests for isProcessAlive(), cleanStalePidFile(), EPERM handling, spawnDaemon setsid, SIGHUP listener
tests/hook-command.test.ts (new) 20 tests for error classifier — transport vs 4xx vs 5xx vs programming errors

Built bundles

File Change
plugin/scripts/worker-service.cjs Rebuilt
plugin/scripts/mcp-server.cjs Rebuilt

Design Decisions

  • ensureWorkerStarted() kept in hook command: Peer review (Gemini + Codex) confirmed this is the only code path that starts the worker — removing it would break auto-recovery. The reduced timeouts (5s max vs 30s) eliminate the double-penalty concern.
  • Two platform multiplier functions preserved: getTimeout() (1.5x) for hook-side fast path, getPlatformTimeout() (2.0x) for worker-side socket operations — clarifying comment added.
  • Conservative error default: Unknown errors map to exit 2 (blocking) rather than exit 0, so new bug categories surface rather than silently failing.
  • Env var override: CLAUDE_MEM_HEALTH_TIMEOUT_MS lets users on unusually slow systems increase the health check timeout without code changes (validated: 500ms–300000ms range).
  • Belt-and-suspenders for daemon survival: Three layers prevent daemon death — (1) SIGHUP handler ignores the signal in daemon mode, (2) setsid creates a new session so SIGHUP is never delivered, (3) uncaughtException/unhandledRejection handlers prevent silent crashes from any source.

Test plan

  • bun test — 935 pass, 3 skip (1 pre-existing timeout in openclaw SSE test, unrelated)
  • npm run build — builds successfully
  • npm run build-and-sync — synced to marketplace, worker restarted
  • Manual: kill worker, start new session → recovers in <5s (observed 1.6s to full recovery)
  • Manual: start session with worker already running → no delay
  • Manual: /exit session → stop hooks complete without BLOCKING_ERROR
  • Manual: CLAUDE_MEM_HEALTH_TIMEOUT_MS=10000 claude → verify override works
  • Manual: bun worker-service.cjs start → daemon survives parent exit (port stays bound)
  • Manual: kill -HUP <worker-pid> → worker logs "Ignoring SIGHUP" and stays alive
  • Manual: kill worker, wait 5 min → watchdog cron restarts it

Lightweight script for Claude Code statusLineCommand integration.
Returns per-project observation and prompt counts via direct SQLite
read (~15ms, no HTTP, no worker dependency).

Counts are scoped with WHERE project = ? to prevent inflated totals
from cross-project observations. Supports CLAUDE_MEM_DATA_DIR from
settings.json for custom data directory configurations.
- Check CLAUDE_MEM_DATA_DIR env var before settings.json (Greptile)
- Derive project before DB check for consistent output (Greptile)
- Include project in error fallback output (Greptile)
- Set executable permission for shebang compatibility (Greptile)
thedotmack#923, thedotmack#984, thedotmack#987, thedotmack#1042)

Reduce timeouts to eliminate 10-30s startup delay when worker is dead
(common on WSL2 after hibernate). Add stale PID detection, graceful
error handling across all handlers, and error classification that
distinguishes worker unavailability from handler bugs.

- HEALTH_CHECK 30s→3s, new POST_SPAWN_WAIT (5s), PORT_IN_USE_WAIT (3s)
- isProcessAlive() with EPERM handling, cleanStalePidFile()
- getPluginVersion() try-catch for shutdown race (thedotmack#1042)
- isWorkerUnavailableError: transport+5xx+429→exit 0, 4xx→exit 2
- No-op handler for unknown event types (thedotmack#984)
- Wrap all handler fetch calls in try-catch for graceful degradation
- CLAUDE_MEM_HEALTH_TIMEOUT_MS env var override with validation
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 10, 2026

Greptile Overview

Greptile Summary

This PR delivers significant performance improvements and graceful degradation for worker lifecycle management. The changes reduce worst-case recovery time from ~60s to ~8s (87% faster) through three key optimizations: reduced timeout constants (HEALTH_CHECK 30s→3s, POST_SPAWN_WAIT 30s→5s, PORT_IN_USE_WAIT 15s→3s), proactive stale PID cleanup via cleanStalePidFile(), and graceful error handling across all hooks.

Key improvements:

Implementation quality:

  • Comprehensive test coverage with 20 new tests for error classification
  • Conservative defaults (unknown errors → blocking) ensure new bug categories surface
  • CLAUDE_MEM_HEALTH_TIMEOUT_MS env override provides escape hatch for slow systems
  • All handlers now gracefully degrade on worker unavailability instead of blocking

Edge cases handled:

  • EPERM handling in isProcessAlive() for multi-user environments
  • ENOENT/EBUSY handling in getPluginVersion() for shutdown race conditions
  • Unknown event types return no-op handler instead of throwing
  • Version check skipped when plugin version is 'unknown'

Confidence Score: 5/5

  • Safe to merge — well-tested performance optimization with comprehensive error handling
  • High confidence due to: (1) thorough test coverage with 20 new error classifier tests and process manager tests, (2) conservative error handling defaults (unknown errors → blocking), (3) targeted fixes for 5 specific issues with clear rationale, (4) reduced timeouts based on empirical data (healthy worker <100ms), (5) graceful degradation preserves user experience during transient failures
  • No files require special attention

Important Files Changed

Filename Overview
src/shared/hook-constants.ts Reduced health check timeout from 30s to 3s and added POST_SPAWN_WAIT (5s) and PORT_IN_USE_WAIT (3s) constants for faster worker recovery
src/services/infrastructure/ProcessManager.ts Added isProcessAlive() using signal-0 idiom with EPERM handling and cleanStalePidFile() for proactive cleanup of dead worker PIDs
src/shared/worker-utils.ts Added CLAUDE_MEM_HEALTH_TIMEOUT_MS env override (500-300000ms) and wrapped getPluginVersion() in try-catch to handle ENOENT/EBUSY during shutdown
src/services/worker-service.ts Called cleanStalePidFile() at start of ensureWorkerStarted() and replaced hardcoded timeouts with HOOK_TIMEOUTS constants
src/cli/hook-command.ts Added isWorkerUnavailableError() classifier to distinguish transport/5xx errors (exit 0) from 4xx/programming errors (exit 2)
src/cli/handlers/index.ts Changed getEventHandler() to return no-op handler for unknown event types instead of throwing (fixes #984)
src/cli/handlers/context.ts Wrapped fetch in try-catch to return empty context on worker failure instead of blocking session start
src/cli/handlers/observation.ts Wrapped fetch in try-catch to skip observation storage gracefully on worker failure instead of throwing

Sequence Diagram

sequenceDiagram
    participant Hook as Hook Command
    participant EWS as ensureWorkerStarted()
    participant PID as PID File
    participant Worker as Worker Process
    participant Health as Health Check

    Note over Hook,Health: Dead Worker Recovery (87% faster)

    Hook->>EWS: Start worker
    EWS->>PID: cleanStalePidFile()
    PID->>PID: Read PID file
    PID->>Worker: isProcessAlive(pid) [signal 0]
    Worker-->>PID: ESRCH (dead)
    PID->>PID: Remove stale PID file
    Note right of PID: Instant cleanup<br/>(was: wait 30s)

    EWS->>Health: waitForHealth(1000ms)
    Health-->>EWS: Not healthy
    
    EWS->>EWS: Check port in use
    EWS->>Health: waitForHealth(3s)
    Note right of Health: Reduced from 15s
    Health-->>EWS: Port free

    EWS->>Worker: spawnDaemon()
    Worker->>Worker: Start process
    Worker->>PID: Write PID file (after listen())
    
    EWS->>Health: waitForHealth(5s)
    Note right of Health: Reduced from 30s
    Health->>Worker: GET /health
    Worker-->>Health: 200 OK
    Health-->>EWS: Healthy
    
    EWS-->>Hook: Worker ready (8s total)
    Note over Hook,Health: Before: ~60s<br/>After: ~8s<br/>87% improvement
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

rodboev and others added 2 commits February 10, 2026 17:36
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Address Greptile review: add comment noting that TypeError('fetch failed')
is already handled by transport patterns before the instanceof check.
@rodboev
Copy link
Contributor Author

rodboev commented Feb 10, 2026

Re: Greptile's review comment on TypeError order dependency:

Good catch — applied in 22683f6. The order dependency is subtle: TypeError('fetch failed') hits the transport pattern check on line 43 before reaching the instanceof TypeError check on line 59, so generic TypeErrors correctly fall through to the blocking path.

thedotmack and others added 3 commits February 10, 2026 23:37
# Conflicts:
#	plugin/scripts/mcp-server.cjs
#	plugin/scripts/worker-service.cjs
Missing return statement and closing brace in the programming errors
check caused a build failure after merging main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: registerSignalHandlers() handled SIGTERM/SIGINT but not
SIGHUP. When the parent hook process exits, the kernel sends SIGHUP
to the daemon, causing immediate termination (default signal action).

Belt-and-suspenders fix:
1. SIGHUP handler: ignore in daemon mode, graceful shutdown otherwise
2. setsid: spawn daemon in new session on Linux (prevents SIGHUP delivery)
3. Global unhandledRejection/uncaughtException guards in daemon mode
@thedotmack thedotmack merged commit 7757966 into thedotmack:main Feb 11, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants