fix: hook resilience and worker lifecycle — 87% faster recovery from dead worker by rodboev · Pull Request #1056 · thedotmack/claude-mem

rodboev · 2026-02-10T22:30:37Z

Summary

87% faster startup when worker is dead — worst-case recovery drops from ~60s to ~8s by reducing health check timeout (30s→3s), post-spawn wait (30s→5s), and port-in-use wait (15s→3s), plus proactive stale PID cleanup
Graceful degradation across all hooks — worker unavailability now returns exit 0 instead of BLOCKING_ERROR, preventing error dialogs during sleep/hibernate recovery
Error classifier distinguishes transport/5xx (graceful) from 4xx/programming errors (blocking), ensuring real bugs still surface while transient failures degrade silently
Fix daemon silent death — SIGHUP handler + setsid + unhandled error guards prevent the worker daemon from dying silently after parent hook process exits

Performance Metrics

Scenario	Before	After	Improvement
Worker dead, stale PID file	~60s (30s health timeout + 30s spawn wait)	~8s (instant PID cleanup + 3s check + 5s spawn)	87% faster
Worker dead, no PID file	~60s (30s timeout + 30s wait)	~8s (3s timeout + 5s wait)	87% faster
Port occupied, worker unresponsive	~15s wait	~3s wait	80% faster
Worker already healthy	~1s	~1s	No change needed
Stop hook with dead worker	BLOCKING_ERROR visible to user	Silent graceful exit	100% fewer visible errors
Daemon dies from SIGHUP after spawn	Always (default signal = terminate)	Never (SIGHUP ignored in daemon mode)	Root cause fix

Issues Closed

Issue	Title	Fix
#957	SessionStart hook chain unnecessarily slow	Reduced `ensureWorkerStarted()` timeouts from 30s→3-5s; stale PID cleanup avoids waiting for dead processes
#923	Slow Plugin Loading Time on Claude Startup	Root cause was the 30s HEALTH_CHECK and 30s POST_SPAWN_WAIT constants; now 3s and 5s respectively
#984	Stop hook fails with "Unknown event type: session-complete"	`getEventHandler()` returns no-op handler for unknown events instead of throwing
#987	Stop hook causes infinite session loop on summarize failure	Error classifier treats summarize HTTP failures as graceful (exit 0), preventing BLOCKING_ERROR from retriggering
#1042	ENOENT error in stop hooks: race condition reading package.json	`getPluginVersion()` wrapped in try-catch, returns `'unknown'` on ENOENT/EBUSY; version check skipped when unknown

Relationship to Open PRs

Add worker idle auto-shutdown with configurable timeout #1055 (worker idle auto-shutdown): Complementary, not overlapping. That PR adds idle timeout to shut down healthy workers after inactivity; this PR fixes recovery when workers are already dead. Both use ensureWorkerStarted() — our timeout reductions make their auto-restart path faster too.
fix: prevent unbounded Claude subprocess spawning in worker daemon #1008 (prevent unbounded subprocess spawning): Different problem scope. That PR addresses orphaned Claude CLI subprocesses; this PR addresses the worker daemon's own lifecycle and hook error handling. No file conflicts.

Changes (17 files)

Core infrastructure

File	Change
`src/shared/hook-constants.ts`	HEALTH_CHECK 30s→3s, add POST_SPAWN_WAIT (5s), PORT_IN_USE_WAIT (3s)
`src/services/infrastructure/ProcessManager.ts`	Add `isProcessAlive()` (EPERM-aware), `cleanStalePidFile()`, `spawnDaemon()` uses `setsid` on Linux
`src/shared/worker-utils.ts`	Wrap `getPluginVersion()` in try-catch (#1042), add `CLAUDE_MEM_HEALTH_TIMEOUT_MS` env override
`src/services/worker-service.ts`	Call `cleanStalePidFile()` at top of `ensureWorkerStarted()`, use named constants, SIGHUP handler (ignore in daemon mode), unhandled error guards in daemon mode
`src/cli/hook-command.ts`	Add `isWorkerUnavailableError()` classifier (transport/5xx→exit 0, 4xx→exit 2)

Handler graceful degradation

File	Change
`src/cli/handlers/index.ts`	Return no-op handler for unknown event types (#984)
`src/cli/handlers/context.ts`	Wrap fetch in try-catch, return empty context on failure
`src/cli/handlers/observation.ts`	Return graceful on `!response.ok` instead of throw
`src/cli/handlers/file-edit.ts`	Same pattern as observation
`src/cli/handlers/user-message.ts`	Add worker check + wrap fetch in try-catch
`src/cli/handlers/session-complete.ts`	Handle `ensureWorkerRunning()` return value

Tests

File	Change
`tests/hook-constants.test.ts`	Updated assertions for new constant values
`tests/infrastructure/process-manager.test.ts`	Tests for `isProcessAlive()`, `cleanStalePidFile()`, EPERM handling, `spawnDaemon` setsid, SIGHUP listener
`tests/hook-command.test.ts` (new)	20 tests for error classifier — transport vs 4xx vs 5xx vs programming errors

Built bundles

File	Change
`plugin/scripts/worker-service.cjs`	Rebuilt
`plugin/scripts/mcp-server.cjs`	Rebuilt

Design Decisions

ensureWorkerStarted() kept in hook command: Peer review (Gemini + Codex) confirmed this is the only code path that starts the worker — removing it would break auto-recovery. The reduced timeouts (5s max vs 30s) eliminate the double-penalty concern.
Two platform multiplier functions preserved: getTimeout() (1.5x) for hook-side fast path, getPlatformTimeout() (2.0x) for worker-side socket operations — clarifying comment added.
Conservative error default: Unknown errors map to exit 2 (blocking) rather than exit 0, so new bug categories surface rather than silently failing.
Env var override: CLAUDE_MEM_HEALTH_TIMEOUT_MS lets users on unusually slow systems increase the health check timeout without code changes (validated: 500ms–300000ms range).
Belt-and-suspenders for daemon survival: Three layers prevent daemon death — (1) SIGHUP handler ignores the signal in daemon mode, (2) setsid creates a new session so SIGHUP is never delivered, (3) uncaughtException/unhandledRejection handlers prevent silent crashes from any source.

Test plan

Lightweight script for Claude Code statusLineCommand integration. Returns per-project observation and prompt counts via direct SQLite read (~15ms, no HTTP, no worker dependency). Counts are scoped with WHERE project = ? to prevent inflated totals from cross-project observations. Supports CLAUDE_MEM_DATA_DIR from settings.json for custom data directory configurations.

- Check CLAUDE_MEM_DATA_DIR env var before settings.json (Greptile) - Derive project before DB check for consistent output (Greptile) - Include project in error fallback output (Greptile) - Set executable permission for shebang compatibility (Greptile)

thedotmack#923, thedotmack#984, thedotmack#987, thedotmack#1042) Reduce timeouts to eliminate 10-30s startup delay when worker is dead (common on WSL2 after hibernate). Add stale PID detection, graceful error handling across all handlers, and error classification that distinguishes worker unavailability from handler bugs. - HEALTH_CHECK 30s→3s, new POST_SPAWN_WAIT (5s), PORT_IN_USE_WAIT (3s) - isProcessAlive() with EPERM handling, cleanStalePidFile() - getPluginVersion() try-catch for shutdown race (thedotmack#1042) - isWorkerUnavailableError: transport+5xx+429→exit 0, 4xx→exit 2 - No-op handler for unknown event types (thedotmack#984) - Wrap all handler fetch calls in try-catch for graceful degradation - CLAUDE_MEM_HEALTH_TIMEOUT_MS env var override with validation

greptile-apps · 2026-02-10T22:34:16Z

Greptile Overview

Greptile Summary

This PR delivers significant performance improvements and graceful degradation for worker lifecycle management. The changes reduce worst-case recovery time from ~60s to ~8s (87% faster) through three key optimizations: reduced timeout constants (HEALTH_CHECK 30s→3s, POST_SPAWN_WAIT 30s→5s, PORT_IN_USE_WAIT 15s→3s), proactive stale PID cleanup via cleanStalePidFile(), and graceful error handling across all hooks.

Key improvements:

Performance: Stale PID cleanup eliminates waiting for dead processes; reduced timeouts reflect actual healthy worker response times (<100ms)
Resilience: Error classifier distinguishes transport/5xx failures (exit 0) from 4xx/programming errors (exit 2), preventing error dialogs during sleep/hibernate recovery
Fixes: Addresses 5 specific issues (SessionStart hook chain unnecessarily stops worker on every startup #957, Slow Plugin Loading Time on Claude Startup #923, Stop hook fails with "Unknown event type: session-complete" #984, Stop hook can cause infinite session loop when summarize output is interpreted as instructions #987, ENOENT error in stop hooks: race condition reading package.json during shutdown #1042) with targeted solutions

Implementation quality:

Comprehensive test coverage with 20 new tests for error classification
Conservative defaults (unknown errors → blocking) ensure new bug categories surface
CLAUDE_MEM_HEALTH_TIMEOUT_MS env override provides escape hatch for slow systems
All handlers now gracefully degrade on worker unavailability instead of blocking

Edge cases handled:

EPERM handling in isProcessAlive() for multi-user environments
ENOENT/EBUSY handling in getPluginVersion() for shutdown race conditions
Unknown event types return no-op handler instead of throwing
Version check skipped when plugin version is 'unknown'

Confidence Score: 5/5

Safe to merge — well-tested performance optimization with comprehensive error handling
High confidence due to: (1) thorough test coverage with 20 new error classifier tests and process manager tests, (2) conservative error handling defaults (unknown errors → blocking), (3) targeted fixes for 5 specific issues with clear rationale, (4) reduced timeouts based on empirical data (healthy worker <100ms), (5) graceful degradation preserves user experience during transient failures
No files require special attention

Important Files Changed

Filename	Overview
src/shared/hook-constants.ts	Reduced health check timeout from 30s to 3s and added POST_SPAWN_WAIT (5s) and PORT_IN_USE_WAIT (3s) constants for faster worker recovery
src/services/infrastructure/ProcessManager.ts	Added isProcessAlive() using signal-0 idiom with EPERM handling and cleanStalePidFile() for proactive cleanup of dead worker PIDs
src/shared/worker-utils.ts	Added CLAUDE_MEM_HEALTH_TIMEOUT_MS env override (500-300000ms) and wrapped getPluginVersion() in try-catch to handle ENOENT/EBUSY during shutdown
src/services/worker-service.ts	Called cleanStalePidFile() at start of ensureWorkerStarted() and replaced hardcoded timeouts with HOOK_TIMEOUTS constants
src/cli/hook-command.ts	Added isWorkerUnavailableError() classifier to distinguish transport/5xx errors (exit 0) from 4xx/programming errors (exit 2)
src/cli/handlers/index.ts	Changed getEventHandler() to return no-op handler for unknown event types instead of throwing (fixes #984)
src/cli/handlers/context.ts	Wrapped fetch in try-catch to return empty context on worker failure instead of blocking session start
src/cli/handlers/observation.ts	Wrapped fetch in try-catch to skip observation storage gracefully on worker failure instead of throwing

Sequence Diagram

sequenceDiagram
    participant Hook as Hook Command
    participant EWS as ensureWorkerStarted()
    participant PID as PID File
    participant Worker as Worker Process
    participant Health as Health Check

    Note over Hook,Health: Dead Worker Recovery (87% faster)

    Hook->>EWS: Start worker
    EWS->>PID: cleanStalePidFile()
    PID->>PID: Read PID file
    PID->>Worker: isProcessAlive(pid) [signal 0]
    Worker-->>PID: ESRCH (dead)
    PID->>PID: Remove stale PID file
    Note right of PID: Instant cleanup<br/>(was: wait 30s)

    EWS->>Health: waitForHealth(1000ms)
    Health-->>EWS: Not healthy
    
    EWS->>EWS: Check port in use
    EWS->>Health: waitForHealth(3s)
    Note right of Health: Reduced from 15s
    Health-->>EWS: Port free

    EWS->>Worker: spawnDaemon()
    Worker->>Worker: Start process
    Worker->>PID: Write PID file (after listen())
    
    EWS->>Health: waitForHealth(5s)
    Note right of Health: Reduced from 30s
    Health->>Worker: GET /health
    Worker-->>Health: 200 OK
    Health-->>EWS: Healthy
    
    EWS-->>Hook: Worker ready (8s total)
    Note over Hook,Health: Before: ~60s<br/>After: ~8s<br/>87% improvement

greptile-apps

_{17 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/cli/hook-command.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Address Greptile review: add comment noting that TypeError('fetch failed') is already handled by transport patterns before the instanceof check.

rodboev · 2026-02-10T22:55:08Z

Re: Greptile's review comment on TypeError order dependency:

Good catch — applied in 22683f6. The order dependency is subtle: TypeError('fetch failed') hits the transport pattern check on line 43 before reaching the instanceof TypeError check on line 59, so generic TypeErrors correctly fall through to the blocking path.

# Conflicts: # plugin/scripts/mcp-server.cjs # plugin/scripts/worker-service.cjs

Missing return statement and closing brace in the programming errors check caused a build failure after merging main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Root cause: registerSignalHandlers() handled SIGTERM/SIGINT but not SIGHUP. When the parent hook process exits, the kernel sends SIGHUP to the daemon, causing immediate termination (default signal action). Belt-and-suspenders fix: 1. SIGHUP handler: ignore in daemon mode, graceful shutdown otherwise 2. setsid: spawn daemon in new session on Linux (prevents SIGHUP delivery) 3. Global unhandledRejection/uncaughtException guards in daemon mode

rodboev added 3 commits February 10, 2026 04:15

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

src/cli/hook-command.ts Show resolved Hide resolved

rodboev and others added 2 commits February 10, 2026 17:36

Clarify order dependency

7ffa1b0

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

fix: clarify TypeError order dependency in error classifier

22683f6

Address Greptile review: add comment noting that TypeError('fetch failed') is already handled by transport patterns before the instanceof check.

thedotmack and others added 3 commits February 10, 2026 23:37

Merge branch 'main' into fix/hook-resilience-worker-lifecycle

af95461

# Conflicts: # plugin/scripts/mcp-server.cjs # plugin/scripts/worker-service.cjs

fix: resolve merge conflict in isWorkerUnavailableError

cb0933a

Missing return statement and closing brace in the programming errors check caused a build failure after merging main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

thedotmack merged commit 7757966 into thedotmack:main Feb 11, 2026
1 check failed

rodboev mentioned this pull request Feb 13, 2026

fix: eliminate unbounded process spawning with 4-layer defense #1085

Closed

9 tasks

claude bot mentioned this pull request Mar 9, 2026

feat: add Cowork compatibility, /mode command, and make-mode skill #1309

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: hook resilience and worker lifecycle — 87% faster recovery from dead worker#1056

fix: hook resilience and worker lifecycle — 87% faster recovery from dead worker#1056
thedotmack merged 8 commits intothedotmack:mainfrom
rodboev:fix/hook-resilience-worker-lifecycle

rodboev commented Feb 10, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 10, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

rodboev commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rodboev commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Metrics

Issues Closed

Relationship to Open PRs

Changes (17 files)

Core infrastructure

Handler graceful degradation

Tests

Built bundles

Design Decisions

Test plan

Uh oh!

greptile-apps bot commented Feb 10, 2026

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rodboev commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rodboev commented Feb 10, 2026 •

edited

Loading