Skip to content

fix: eliminate unbounded process spawning with 4-layer defense#1085

Closed
rodboev wants to merge 9 commits intothedotmack:mainfrom
rodboev:fix/process-spawn-stabilization
Closed

fix: eliminate unbounded process spawning with 4-layer defense#1085
rodboev wants to merge 9 commits intothedotmack:mainfrom
rodboev:fix/process-spawn-stabilization

Conversation

@rodboev
Copy link
Contributor

@rodboev rodboev commented Feb 13, 2026

Summary

Eliminates the root cause of unbounded worker and chroma-mcp process spawning that repeatedly crashes WSL2:

  • Layer 0: Filesystem mutex (proper-lockfile) prevents TOCTOU race in worker spawn — only lock holder can spawn; concurrent hooks wait for port health instead
  • Layer 1: EADDRINUSE suicide pact — failed workers exit(0) immediately instead of entering initializeBackground() and spawning zombie subprocesses
  • Layer 2: Remove redundant start commands from hooks — halves spawn attempts per hook event (each hook's hook <event> command already calls ensureWorkerStarted())
  • Layer 3: ChromaSync connection mutex + circuit breaker + safe state reset + count-based orphan reaper

~400 lines across 9 files (including 169 lines of tests). Designed to survive merge conflicts (small, isolated changes; new singleton-manager.ts in its own file that no parallel development will touch).

Empirical Evidence

Metric Before After Source
Max chroma-mcp processes (5 min) 641 ≤2 (count-based reaper) CHANGELOG.md, commit a3f9e7f
Max concurrent workers 19 1 (filesystem mutex) Design doc, Feb 13 incident
Spawn attempts per hook event 2 (start + hook) 1 (hook only) hooks.json diff
Orphan age threshold 30 min 5 min ProcessManager.ts:34
TOCTOU race window 100-500ms 0 (serialized by lock) singleton-manager.ts
Worker spawn time (Linux) <1s <1s (unchanged) hook-constants.ts:4
Connection mutex (concurrent searches) N searches → N chroma-mcp N searches → 1 chroma-mcp ChromaSync promise memoization
Circuit breaker None 3 failures → 60s cooldown ChromaSync.ts:99-100
Failed worker behavior on EADDRINUSE Continue to initializeBackground() → spawn subprocesses exit(0) immediately worker-service.ts:340-346
Lock stale timeout N/A 10s (2x POST_SPAWN_WAIT on Linux, auto-release for crashes) singleton-manager.ts:42

Timing Evidence (from hook-constants.ts)

POST_SPAWN_WAIT:  5000ms  // Worker spawn completes in <1s on Linux
HEALTH_CHECK:     3000ms  // Healthy worker responds in <100ms
PORT_IN_USE_WAIT: 3000ms  // Wait for port release
WINDOWS_MULTIPLIER: 1.5x  // 7.5s spawn wait on Windows

Why Previous Fixes Didn't Stick

PR What It Fixed Why It Wasn't Enough
#959 Blocking startup (fail-open) Orthogonal to spawn race
#960 Redundant hook error Orthogonal to spawn race
#1047 Statusline (bypass worker) Reduces dependency, doesn't fix race
#1056 Timeout reduction (30s→3-5s) Faster recovery, but paradoxically increases spawn rate during races
#1065 5-layer chroma defense Merged Feb 11, overwritten 31 hours later by PR #1076 (4504-line merge conflict resolution) — all defense code lost

PR #1065 Reversion Evidence

Relationship to Existing Issues & PRs

Direct Fixes

Rebuilds Work from PR #1065

PR #1065 (5-layer chroma defense) was merged 2026-02-11 but overwritten by PR #1076's merge conflict resolution 31 hours later — current main has zero spawn storm protection. This PR rebuilds Layers 3-4 from #1065 (connection mutex, circuit breaker, count-based reaper) and adds new Layers 0-2 (worker-level defenses that #1065 did not have).

Complementary PRs (Can Coexist)

Partial Overlaps / Gaps Remaining

Changes

File Change Lines
NEW src/services/infrastructure/singleton-manager.ts Filesystem mutex via proper-lockfile +61
src/services/worker-service.ts Mutex integration in ensureWorkerStarted() + EADDRINUSE exit +65/-18
plugin/hooks/hooks.json Remove 4 redundant start commands -20
src/services/sync/ChromaSync.ts Connection mutex, circuit breaker, safe state reset, DRY close() +124/-53
src/services/infrastructure/ProcessManager.ts Lower orphan age 30→5min, count-based chroma reaper +54/-1
package.json Add proper-lockfile + @types/proper-lockfile +2

Review Process

This PR was validated by 5 independent review sources:

Reviewer Key Findings Status
Gemini (local) DRY violation in close(), stale timeout tuning, Windows gap All fixed
Codex (local) Mutex+EADDRINUSE interaction analysis, hook cascade analysis Analyzed: benign on Linux/WSL2 (SO_REUSEADDR handles TIME_WAIT)
Internal code review Orphaned JSDoc, safeResetConnection ordering analysis JSDoc fixed; ordering confirmed correct
Greptile (GitHub) 5/5 confidence, "Safe to merge" No issues found
claude-review CI OIDC token misconfiguration Repo infra issue, not code

Test plan

  • bun test — 882 pass, 7 new tests all pass (23 pre-existing failures unchanged)
  • tsc --noEmit — no new type errors
  • Gemini peer review — findings addressed
  • Codex peer review — architectural concerns analyzed with codebase evidence
  • Internal code review — orphaned JSDoc fixed
  • Greptile — 5/5 confidence
  • Manual: spawn 10 concurrent hooks, verify only 1 worker
  • Manual: fire concurrent searches, verify ≤2 chroma-mcp processes
  • Manual: SIGHUP test — clean restart, no orphans

Uses proper-lockfile (per-port lock file) so only one process at a time
can attempt to spawn a worker daemon. Lock-losers get null and fall back
to waiting for port health instead.
Only the lock holder can spawn a worker. Lock-losers get null and fall
back to waiting for port health. Double-check pattern inside the lock
re-verifies health before spawning.
…cesses

Workers that lose the port race now exit(0) before entering
initializeBackground(), preventing each race-loser from spawning
its own tree of chroma-mcp subprocesses.
…ts per event

The hook command already calls ensureWorkerStarted() internally (line 1009
of worker-service.ts). The separate start command doubled the spawn
attempt for every hook event.
… reset

Prevents concurrent ensureConnection() calls from spawning multiple
chroma-mcp subprocesses. Circuit breaker stops retry storms after 3
consecutive failures. Safe reset closes transport before nulling
reference to prevent orphaned subprocesses. Fixed close() early-return
bug where error handlers could skip subprocess cleanup.
Age-based cleanup (was 30min) completely missed spawn storms where all
processes are <5min old. Count-based reaper kills excess chroma-mcp
regardless of age, keeping only the 2 newest.
- DRY: close() now delegates to safeResetConnection() instead of
  duplicating the capture-null-close pattern
- Reduce stale lock timeout from 30s to 10s (spawn should complete
  well within 10s; shorter timeout = faster recovery from crashes)
- Add comment clarifying Windows gap in count-based cleanup
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 13, 2026

Greptile Overview

Greptile Summary

Implements a comprehensive 4-layer defense system to eliminate unbounded worker and chroma-mcp process spawning that crashed WSL2 with 641 chroma-mcp processes in 5 minutes.

Layer 0 (Filesystem Mutex): New singleton-manager.ts uses proper-lockfile to prevent TOCTOU races in worker spawn — only the lock holder spawns; concurrent hooks wait for port health. 10s stale timeout handles crashed processes.

Layer 1 (EADDRINUSE Suicide Pact): Workers that fail to bind the port immediately exit(0) instead of entering initializeBackground() and spawning zombie subprocesses.

Layer 2 (Remove Redundant Commands): Removed 4 redundant start commands from hooks.json — halves spawn attempts since each hook's command already calls ensureWorkerStarted().

Layer 3 (ChromaSync Protection): Connection mutex coalesces concurrent calls onto a single spawn. Circuit breaker stops retry storms after 3 consecutive failures (60s cooldown). Safe state reset closes transport before nulling reference to prevent orphaned chroma-mcp processes.

Layer 4 (Improved Cleanup): Lowered orphan age from 30→5 minutes and added count-based chroma-mcp reaper that keeps 2 newest processes and kills the rest.

The implementation is well-tested with 7 new tests covering mutex contention, EADDRINUSE detection, connection coalescing, and circuit breaker logic.

Confidence Score: 5/5

  • Safe to merge — comprehensive solution with excellent test coverage and defensive error handling
  • Multi-layered defense-in-depth approach with proper error handling, well-tested implementation, and graceful degradation patterns. Each layer independently addresses the spawn storm issue, making the solution robust against edge cases.
  • No files require special attention

Important Files Changed

Filename Overview
plugin/hooks/hooks.json Removed 4 redundant start commands from hooks (Layer 2) — each hook's command already calls ensureWorkerStarted()
src/services/infrastructure/singleton-manager.ts New filesystem mutex using proper-lockfile (Layer 0) — prevents TOCTOU race in worker spawn with 10s stale timeout
src/services/worker-service.ts Integrated spawn lock in ensureWorkerStarted() and added EADDRINUSE suicide pact (Layer 1) — failed workers exit(0) immediately
src/services/infrastructure/ProcessManager.ts Lowered orphan age from 30 to 5 minutes and added count-based chroma-mcp reaper (keeps 2 newest, kills rest)
src/services/sync/ChromaSync.ts Added connection mutex, circuit breaker (3 failures → 60s cooldown), safe state reset, and DRY close() logic (Layer 3)

Sequence Diagram

sequenceDiagram
    participant H1 as Hook Process 1
    participant H2 as Hook Process 2
    participant FS as Filesystem Lock
    participant W as Worker Daemon
    participant P as Port 37777

    Note over H1,H2: Multiple hooks fire simultaneously

    H1->>FS: acquireSpawnLock()
    H2->>FS: acquireSpawnLock()
    FS-->>H1: Lock acquired ✓
    FS-->>H2: ELOCKED (null)
    
    Note over H2: Falls back to waitForHealth()
    
    H1->>P: Check if port in use
    P-->>H1: Port free
    
    H1->>W: spawnDaemon()
    activate W
    W->>P: server.listen(port)
    
    alt Port bind succeeds
        P-->>W: Listening ✓
        W->>W: writePidFile()
        W->>W: initializeBackground()
        H1->>W: waitForHealth()
        W-->>H1: 200 OK
        H1->>FS: Release lock
        
        H2->>W: waitForHealth()
        W-->>H2: 200 OK
        Note over H2: Worker ready
    else Port already bound (EADDRINUSE)
        P-->>W: EADDRINUSE error
        W->>W: process.exit(0)
        deactivate W
        Note over W: Suicide pact prevents zombie spawn
    end
Loading

Last reviewed commit: 0b77f8f

@thedotmack
Copy link
Owner

Thanks for this thorough process spawning defense! Note: v10.0.7 refactored ChromaSync from MCP stdio transport to an HTTP client model (ChromaServerManager), which may make some of the ChromaSync process-spawning guards obsolete. Could you rebase onto main and evaluate which defense layers are still needed given the new architecture? The worker spawn defense layers (filesystem mutex, EADDRINUSE exit) are likely still valuable.

@rodboev
Copy link
Contributor Author

rodboev commented Feb 17, 2026

I am actually running a monkey patched version which uses 1mcp as the host to prevent multiple workers spawning. Minimal overhead via either local stdio proxy (like mcp-remote) or http, with chroma in systemd. It turned down the noise a lot, but between questioning whether this was likely to be get interest, I noticed a unusual traffic and and usage patterns, and am rebuilding my env from backups. Given those it's just a bit more til I learn how to monkey patch wsl2 itself and finish, but have that ready to commit, and will definitely revisit this as well. Thanks for letting me know, I was trying to figure out where you had run into issues.

@thedotmack
Copy link
Owner

Superseded by the embedded Process Supervisor (PR #1370, v10.5.6). Process spawning is now managed through a centralized ProcessRegistry with session-scoped tracking, health checks, and graceful shutdown cascades.

@thedotmack thedotmack closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants