fix: eliminate unbounded process spawning with 4-layer defense by rodboev · Pull Request #1085 · thedotmack/claude-mem

rodboev · 2026-02-13T07:46:31Z

Summary

Eliminates the root cause of unbounded worker and chroma-mcp process spawning that repeatedly crashes WSL2:

Layer 0: Filesystem mutex (proper-lockfile) prevents TOCTOU race in worker spawn — only lock holder can spawn; concurrent hooks wait for port health instead
Layer 1: EADDRINUSE suicide pact — failed workers exit(0) immediately instead of entering initializeBackground() and spawning zombie subprocesses
Layer 2: Remove redundant start commands from hooks — halves spawn attempts per hook event (each hook's hook <event> command already calls ensureWorkerStarted())
Layer 3: ChromaSync connection mutex + circuit breaker + safe state reset + count-based orphan reaper

~400 lines across 9 files (including 169 lines of tests). Designed to survive merge conflicts (small, isolated changes; new singleton-manager.ts in its own file that no parallel development will touch).

Empirical Evidence

Metric	Before	After	Source
Max chroma-mcp processes (5 min)	641	≤2 (count-based reaper)	CHANGELOG.md, commit `a3f9e7f`
Max concurrent workers	19	1 (filesystem mutex)	Design doc, Feb 13 incident
Spawn attempts per hook event	2 (`start` + `hook`)	1 (`hook` only)	hooks.json diff
Orphan age threshold	30 min	5 min	ProcessManager.ts:34
TOCTOU race window	100-500ms	0 (serialized by lock)	singleton-manager.ts
Worker spawn time (Linux)	<1s	<1s (unchanged)	hook-constants.ts:4
Connection mutex (concurrent searches)	N searches → N chroma-mcp	N searches → 1 chroma-mcp	ChromaSync promise memoization
Circuit breaker	None	3 failures → 60s cooldown	ChromaSync.ts:99-100
Failed worker behavior on EADDRINUSE	Continue to `initializeBackground()` → spawn subprocesses	`exit(0)` immediately	worker-service.ts:340-346
Lock stale timeout	N/A	10s (2x POST_SPAWN_WAIT on Linux, auto-release for crashes)	singleton-manager.ts:42

Timing Evidence (from hook-constants.ts)

POST_SPAWN_WAIT:  5000ms  // Worker spawn completes in <1s on Linux
HEALTH_CHECK:     3000ms  // Healthy worker responds in <100ms
PORT_IN_USE_WAIT: 3000ms  // Wait for port release
WINDOWS_MULTIPLIER: 1.5x  // 7.5s spawn wait on Windows

Why Previous Fixes Didn't Stick

PR	What It Fixed	Why It Wasn't Enough
#959	Blocking startup (fail-open)	Orthogonal to spawn race
#960	Redundant hook error	Orthogonal to spawn race
#1047	Statusline (bypass worker)	Reduces dependency, doesn't fix race
#1056	Timeout reduction (30s→3-5s)	Faster recovery, but paradoxically increases spawn rate during races
#1065	5-layer chroma defense	Merged Feb 11, overwritten 31 hours later by PR #1076 (4504-line merge conflict resolution) — all defense code lost

PR #1065 Reversion Evidence

PR fix: prevent chroma-mcp spawn storm with 5-layer defense (641 processes → max 2) #1065 merged: 2026-02-11T20:43:38Z (commit 3f01baeb)
PR Add OpenClaw one-liner installer script with comprehensive test suite #1076 merged: 2026-02-13T03:22:25Z (commit 52ea4520, +4504/-304 lines)
Current main: zero spawn storm protection code
This PR's survivability strategy: standalone singleton-manager.ts + small isolated changes across 5 existing files

Relationship to Existing Issues & PRs

Direct Fixes

Worker daemon spawns orphaned claude-sonnet-4-5 subagent processes (~1/min, never cleaned up) #1010 (orphaned sonnet processes, 350MB/process at ~1/min): Root cause fix via Layer 0 (filesystem mutex) + Layer 1 (EADDRINUSE exit)
Worker daemon leaks Claude subagent processes #1068 (worker daemon leaks, 296 orphaned processes, 12GB RAM): Same root cause as Worker daemon spawns orphaned claude-sonnet-4-5 subagent processes (~1/min, never cleaned up) #1010 — unified fix
chroma-mcp processes are never cleaned up when Claude Code sessions end — causes OOM #1077 (chroma-mcp OOM, 146 processes consuming 5.6GB): Directly addressed by Layer 3 (connection mutex + circuit breaker + count-based reaper)

Rebuilds Work from PR #1065

PR #1065 (5-layer chroma defense) was merged 2026-02-11 but overwritten by PR #1076's merge conflict resolution 31 hours later — current main has zero spawn storm protection. This PR rebuilds Layers 3-4 from #1065 (connection mutex, circuit breaker, count-based reaper) and adds new Layers 0-2 (worker-level defenses that #1065 did not have).

Complementary PRs (Can Coexist)

fix: prevent unbounded Claude subprocess spawning in worker daemon #1008 (subprocess rate limit): Caps damage; this PR prevents damage at source. Both can coexist for defense-in-depth.
feat: configurable subprocess pool limit for SDK agents #995 (SDK agent pool limit): Different system layer (SDK agent queuing vs worker startup race). Independent and complementary.
feat: parent heartbeat for MCP server orphan prevention #992 (MCP heartbeat orphan prevention): Different mechanism (ppid-based). Both valuable.
Remove SessionStart matcher to fix VSCode extension compatibility #1053 (SessionStart matcher removal): Fixes when hooks trigger; Layer 2 fixes what they do. Independent.

Partial Overlaps / Gaps Remaining

[Windows] Claude Code hangs on startup from Git Bash — PowerShell not in PATH breaks worker hooks #1062 (Git Bash hang): Layer 2 reduces hook calls but doesn't solve PowerShell PATH issue
Zombie TCP socket on Windows blocks worker restart (regression from #415) #1023 (zombie socket on Windows): Layer 1 minimizes window but doesn't address taskkill /F in wrapper
Orphan reaper fails on Windows: $_ PowerShell syntax error (regression from #517) #1024 (Windows reaper PowerShell error): Count-based reaper uses ps on Unix; Windows cleanup relies on existing age-based Get-CimInstance logic

Changes

File	Change	Lines
NEW `src/services/infrastructure/singleton-manager.ts`	Filesystem mutex via `proper-lockfile`	+61
`src/services/worker-service.ts`	Mutex integration in `ensureWorkerStarted()` + EADDRINUSE exit	+65/-18
`plugin/hooks/hooks.json`	Remove 4 redundant `start` commands	-20
`src/services/sync/ChromaSync.ts`	Connection mutex, circuit breaker, safe state reset, DRY close()	+124/-53
`src/services/infrastructure/ProcessManager.ts`	Lower orphan age 30→5min, count-based chroma reaper	+54/-1
`package.json`	Add `proper-lockfile` + `@types/proper-lockfile`	+2

Review Process

This PR was validated by 5 independent review sources:

Reviewer	Key Findings	Status
Gemini (local)	DRY violation in close(), stale timeout tuning, Windows gap	All fixed
Codex (local)	Mutex+EADDRINUSE interaction analysis, hook cascade analysis	Analyzed: benign on Linux/WSL2 (SO_REUSEADDR handles TIME_WAIT)
Internal code review	Orphaned JSDoc, safeResetConnection ordering analysis	JSDoc fixed; ordering confirmed correct
Greptile (GitHub)	5/5 confidence, "Safe to merge"	No issues found
claude-review CI	OIDC token misconfiguration	Repo infra issue, not code

Test plan

bun test — 882 pass, 7 new tests all pass (23 pre-existing failures unchanged)
tsc --noEmit — no new type errors
Gemini peer review — findings addressed
Codex peer review — architectural concerns analyzed with codebase evidence
Internal code review — orphaned JSDoc fixed
Greptile — 5/5 confidence
Manual: spawn 10 concurrent hooks, verify only 1 worker
Manual: fire concurrent searches, verify ≤2 chroma-mcp processes
Manual: SIGHUP test — clean restart, no orphans

Uses proper-lockfile (per-port lock file) so only one process at a time can attempt to spawn a worker daemon. Lock-losers get null and fall back to waiting for port health instead.

Only the lock holder can spawn a worker. Lock-losers get null and fall back to waiting for port health. Double-check pattern inside the lock re-verifies health before spawning.

…cesses Workers that lose the port race now exit(0) before entering initializeBackground(), preventing each race-loser from spawning its own tree of chroma-mcp subprocesses.

…ts per event The hook command already calls ensureWorkerStarted() internally (line 1009 of worker-service.ts). The separate start command doubled the spawn attempt for every hook event.

… reset Prevents concurrent ensureConnection() calls from spawning multiple chroma-mcp subprocesses. Circuit breaker stops retry storms after 3 consecutive failures. Safe reset closes transport before nulling reference to prevent orphaned subprocesses. Fixed close() early-return bug where error handlers could skip subprocess cleanup.

Age-based cleanup (was 30min) completely missed spawn storms where all processes are <5min old. Count-based reaper kills excess chroma-mcp regardless of age, keeping only the 2 newest.

- DRY: close() now delegates to safeResetConnection() instead of duplicating the capture-null-close pattern - Reduce stale lock timeout from 30s to 10s (spawn should complete well within 10s; shorter timeout = faster recovery from crashes) - Add comment clarifying Windows gap in count-based cleanup

greptile-apps · 2026-02-13T07:49:22Z

Greptile Overview

Greptile Summary

Implements a comprehensive 4-layer defense system to eliminate unbounded worker and chroma-mcp process spawning that crashed WSL2 with 641 chroma-mcp processes in 5 minutes.

Layer 0 (Filesystem Mutex): New singleton-manager.ts uses proper-lockfile to prevent TOCTOU races in worker spawn — only the lock holder spawns; concurrent hooks wait for port health. 10s stale timeout handles crashed processes.

Layer 1 (EADDRINUSE Suicide Pact): Workers that fail to bind the port immediately exit(0) instead of entering initializeBackground() and spawning zombie subprocesses.

Layer 2 (Remove Redundant Commands): Removed 4 redundant start commands from hooks.json — halves spawn attempts since each hook's command already calls ensureWorkerStarted().

Layer 3 (ChromaSync Protection): Connection mutex coalesces concurrent calls onto a single spawn. Circuit breaker stops retry storms after 3 consecutive failures (60s cooldown). Safe state reset closes transport before nulling reference to prevent orphaned chroma-mcp processes.

Layer 4 (Improved Cleanup): Lowered orphan age from 30→5 minutes and added count-based chroma-mcp reaper that keeps 2 newest processes and kills the rest.

The implementation is well-tested with 7 new tests covering mutex contention, EADDRINUSE detection, connection coalescing, and circuit breaker logic.

Confidence Score: 5/5

Safe to merge — comprehensive solution with excellent test coverage and defensive error handling
Multi-layered defense-in-depth approach with proper error handling, well-tested implementation, and graceful degradation patterns. Each layer independently addresses the spawn storm issue, making the solution robust against edge cases.
No files require special attention

Important Files Changed

Filename	Overview
plugin/hooks/hooks.json	Removed 4 redundant `start` commands from hooks (Layer 2) — each hook's command already calls ensureWorkerStarted()
src/services/infrastructure/singleton-manager.ts	New filesystem mutex using proper-lockfile (Layer 0) — prevents TOCTOU race in worker spawn with 10s stale timeout
src/services/worker-service.ts	Integrated spawn lock in ensureWorkerStarted() and added EADDRINUSE suicide pact (Layer 1) — failed workers exit(0) immediately
src/services/infrastructure/ProcessManager.ts	Lowered orphan age from 30 to 5 minutes and added count-based chroma-mcp reaper (keeps 2 newest, kills rest)
src/services/sync/ChromaSync.ts	Added connection mutex, circuit breaker (3 failures → 60s cooldown), safe state reset, and DRY close() logic (Layer 3)

Sequence Diagram

sequenceDiagram
    participant H1 as Hook Process 1
    participant H2 as Hook Process 2
    participant FS as Filesystem Lock
    participant W as Worker Daemon
    participant P as Port 37777

    Note over H1,H2: Multiple hooks fire simultaneously

    H1->>FS: acquireSpawnLock()
    H2->>FS: acquireSpawnLock()
    FS-->>H1: Lock acquired ✓
    FS-->>H2: ELOCKED (null)
    
    Note over H2: Falls back to waitForHealth()
    
    H1->>P: Check if port in use
    P-->>H1: Port free
    
    H1->>W: spawnDaemon()
    activate W
    W->>P: server.listen(port)
    
    alt Port bind succeeds
        P-->>W: Listening ✓
        W->>W: writePidFile()
        W->>W: initializeBackground()
        H1->>W: waitForHealth()
        W-->>H1: 200 OK
        H1->>FS: Release lock
        
        H2->>W: waitForHealth()
        W-->>H2: 200 OK
        Note over H2: Worker ready
    else Port already bound (EADDRINUSE)
        P-->>W: EADDRINUSE error
        W->>W: process.exit(0)
        deactivate W
        Note over W: Suicide pact prevents zombie spawn
    end

_{Last reviewed commit: 0b77f8f}

thedotmack · 2026-02-16T05:40:23Z

Thanks for this thorough process spawning defense! Note: v10.0.7 refactored ChromaSync from MCP stdio transport to an HTTP client model (ChromaServerManager), which may make some of the ChromaSync process-spawning guards obsolete. Could you rebase onto main and evaluate which defense layers are still needed given the new architecture? The worker spawn defense layers (filesystem mutex, EADDRINUSE exit) are likely still valuable.

rodboev · 2026-02-17T01:30:49Z

I am actually running a monkey patched version which uses 1mcp as the host to prevent multiple workers spawning. Minimal overhead via either local stdio proxy (like mcp-remote) or http, with chroma in systemd. It turned down the noise a lot, but between questioning whether this was likely to be get interest, I noticed a unusual traffic and and usage patterns, and am rebuilding my env from backups. Given those it's just a bit more til I learn how to monkey patch wsl2 itself and finish, but have that ready to commit, and will definitely revisit this as well. Thanks for letting me know, I was trying to figure out where you had run into issues.

thedotmack · 2026-03-16T22:05:43Z

Superseded by the embedded Process Supervisor (PR #1370, v10.5.6). Process spawning is now managed through a centralized ProcessRegistry with session-scoped tracking, health checks, and graceful shutdown cascades.

rodboev added 9 commits February 13, 2026 02:31

chore: add proper-lockfile for spawn coordination

ce04c6c

feat: add filesystem mutex for worker spawn coordination

f6b4d81

Uses proper-lockfile (per-port lock file) so only one process at a time can attempt to spawn a worker daemon. Lock-losers get null and fall back to waiting for port health instead.

feat: wrap worker spawn in filesystem mutex to prevent TOCTOU race

f6f8605

Only the lock holder can spawn a worker. Lock-losers get null and fall back to waiting for port health. Double-check pattern inside the lock re-verifies health before spawning.

fix: exit immediately on EADDRINUSE instead of spawning zombie subpro…

79388c9

…cesses Workers that lose the port race now exit(0) before entering initializeBackground(), preventing each race-loser from spawning its own tree of chroma-mcp subprocesses.

fix: remove redundant start commands from hooks — halves spawn attemp…

0000942

…ts per event The hook command already calls ensureWorkerStarted() internally (line 1009 of worker-service.ts). The separate start command doubled the spawn attempt for every hook event.

fix: lower orphan age to 5min and add count-based chroma reaper

2f092cb

Age-based cleanup (was 30min) completely missed spawn storms where all processes are <5min old. Count-based reaper kills excess chroma-mcp regardless of age, keeping only the 2 newest.

chore: add @types/proper-lockfile for type safety

d6c8d6d

rodboev force-pushed the fix/process-spawn-stabilization branch from 56176cc to 0b77f8f Compare February 14, 2026 10:59

thedotmack mentioned this pull request Feb 16, 2026

fix: prevent unbounded Claude subprocess spawning in worker daemon #1008

Closed

thedotmack closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: eliminate unbounded process spawning with 4-layer defense#1085

fix: eliminate unbounded process spawning with 4-layer defense#1085
rodboev wants to merge 9 commits intothedotmack:mainfrom
rodboev:fix/process-spawn-stabilization

rodboev commented Feb 13, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 13, 2026

Uh oh!

thedotmack commented Feb 16, 2026

Uh oh!

rodboev commented Feb 17, 2026

Uh oh!

thedotmack commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rodboev commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Empirical Evidence

Timing Evidence (from hook-constants.ts)

Why Previous Fixes Didn't Stick

PR #1065 Reversion Evidence

Relationship to Existing Issues & PRs

Direct Fixes

Rebuilds Work from PR #1065

Complementary PRs (Can Coexist)

Partial Overlaps / Gaps Remaining

Changes

Review Process

Test plan

Uh oh!

greptile-apps bot commented Feb 13, 2026

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

thedotmack commented Feb 16, 2026

Uh oh!

rodboev commented Feb 17, 2026

Uh oh!

thedotmack commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rodboev commented Feb 13, 2026 •

edited

Loading