fix: eliminate unbounded process spawning with 4-layer defense#1085
fix: eliminate unbounded process spawning with 4-layer defense#1085rodboev wants to merge 9 commits intothedotmack:mainfrom
Conversation
Uses proper-lockfile (per-port lock file) so only one process at a time can attempt to spawn a worker daemon. Lock-losers get null and fall back to waiting for port health instead.
Only the lock holder can spawn a worker. Lock-losers get null and fall back to waiting for port health. Double-check pattern inside the lock re-verifies health before spawning.
…cesses Workers that lose the port race now exit(0) before entering initializeBackground(), preventing each race-loser from spawning its own tree of chroma-mcp subprocesses.
…ts per event The hook command already calls ensureWorkerStarted() internally (line 1009 of worker-service.ts). The separate start command doubled the spawn attempt for every hook event.
… reset Prevents concurrent ensureConnection() calls from spawning multiple chroma-mcp subprocesses. Circuit breaker stops retry storms after 3 consecutive failures. Safe reset closes transport before nulling reference to prevent orphaned subprocesses. Fixed close() early-return bug where error handlers could skip subprocess cleanup.
Age-based cleanup (was 30min) completely missed spawn storms where all processes are <5min old. Count-based reaper kills excess chroma-mcp regardless of age, keeping only the 2 newest.
- DRY: close() now delegates to safeResetConnection() instead of duplicating the capture-null-close pattern - Reduce stale lock timeout from 30s to 10s (spawn should complete well within 10s; shorter timeout = faster recovery from crashes) - Add comment clarifying Windows gap in count-based cleanup
Greptile OverviewGreptile SummaryImplements a comprehensive 4-layer defense system to eliminate unbounded worker and chroma-mcp process spawning that crashed WSL2 with 641 chroma-mcp processes in 5 minutes. Layer 0 (Filesystem Mutex): New Layer 1 (EADDRINUSE Suicide Pact): Workers that fail to bind the port immediately Layer 2 (Remove Redundant Commands): Removed 4 redundant Layer 3 (ChromaSync Protection): Connection mutex coalesces concurrent calls onto a single spawn. Circuit breaker stops retry storms after 3 consecutive failures (60s cooldown). Safe state reset closes transport before nulling reference to prevent orphaned chroma-mcp processes. Layer 4 (Improved Cleanup): Lowered orphan age from 30→5 minutes and added count-based chroma-mcp reaper that keeps 2 newest processes and kills the rest. The implementation is well-tested with 7 new tests covering mutex contention, EADDRINUSE detection, connection coalescing, and circuit breaker logic. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant H1 as Hook Process 1
participant H2 as Hook Process 2
participant FS as Filesystem Lock
participant W as Worker Daemon
participant P as Port 37777
Note over H1,H2: Multiple hooks fire simultaneously
H1->>FS: acquireSpawnLock()
H2->>FS: acquireSpawnLock()
FS-->>H1: Lock acquired ✓
FS-->>H2: ELOCKED (null)
Note over H2: Falls back to waitForHealth()
H1->>P: Check if port in use
P-->>H1: Port free
H1->>W: spawnDaemon()
activate W
W->>P: server.listen(port)
alt Port bind succeeds
P-->>W: Listening ✓
W->>W: writePidFile()
W->>W: initializeBackground()
H1->>W: waitForHealth()
W-->>H1: 200 OK
H1->>FS: Release lock
H2->>W: waitForHealth()
W-->>H2: 200 OK
Note over H2: Worker ready
else Port already bound (EADDRINUSE)
P-->>W: EADDRINUSE error
W->>W: process.exit(0)
deactivate W
Note over W: Suicide pact prevents zombie spawn
end
Last reviewed commit: 0b77f8f |
56176cc to
0b77f8f
Compare
|
Thanks for this thorough process spawning defense! Note: v10.0.7 refactored ChromaSync from MCP stdio transport to an HTTP client model ( |
|
I am actually running a monkey patched version which uses 1mcp as the host to prevent multiple workers spawning. Minimal overhead via either local stdio proxy (like mcp-remote) or http, with chroma in systemd. It turned down the noise a lot, but between questioning whether this was likely to be get interest, I noticed a unusual traffic and and usage patterns, and am rebuilding my env from backups. Given those it's just a bit more til I learn how to monkey patch wsl2 itself and finish, but have that ready to commit, and will definitely revisit this as well. Thanks for letting me know, I was trying to figure out where you had run into issues. |
|
Superseded by the embedded Process Supervisor (PR #1370, v10.5.6). Process spawning is now managed through a centralized ProcessRegistry with session-scoped tracking, health checks, and graceful shutdown cascades. |
Summary
Eliminates the root cause of unbounded worker and chroma-mcp process spawning that repeatedly crashes WSL2:
proper-lockfile) prevents TOCTOU race in worker spawn — only lock holder can spawn; concurrent hooks wait for port health insteadexit(0)immediately instead of enteringinitializeBackground()and spawning zombie subprocessesstartcommands from hooks — halves spawn attempts per hook event (each hook'shook <event>command already callsensureWorkerStarted())~400 lines across 9 files (including 169 lines of tests). Designed to survive merge conflicts (small, isolated changes; new
singleton-manager.tsin its own file that no parallel development will touch).Empirical Evidence
start+hook)hookonly)initializeBackground()→ spawn subprocessesexit(0)immediatelyTiming Evidence (from hook-constants.ts)
Why Previous Fixes Didn't Stick
PR #1065 Reversion Evidence
2026-02-11T20:43:38Z(commit3f01baeb)2026-02-13T03:22:25Z(commit52ea4520, +4504/-304 lines)singleton-manager.ts+ small isolated changes across 5 existing filesRelationship to Existing Issues & PRs
Direct Fixes
Rebuilds Work from PR #1065
PR #1065 (5-layer chroma defense) was merged 2026-02-11 but overwritten by PR #1076's merge conflict resolution 31 hours later — current main has zero spawn storm protection. This PR rebuilds Layers 3-4 from #1065 (connection mutex, circuit breaker, count-based reaper) and adds new Layers 0-2 (worker-level defenses that #1065 did not have).
Complementary PRs (Can Coexist)
Partial Overlaps / Gaps Remaining
taskkill /Fin wrapperpson Unix; Windows cleanup relies on existing age-basedGet-CimInstancelogicChanges
src/services/infrastructure/singleton-manager.tsproper-lockfilesrc/services/worker-service.tsensureWorkerStarted()+ EADDRINUSE exitplugin/hooks/hooks.jsonstartcommandssrc/services/sync/ChromaSync.tssrc/services/infrastructure/ProcessManager.tspackage.jsonproper-lockfile+@types/proper-lockfileReview Process
This PR was validated by 5 independent review sources:
Test plan
bun test— 882 pass, 7 new tests all pass (23 pre-existing failures unchanged)tsc --noEmit— no new type errors