fix(chroma): close transport before nulling to prevent subprocess leak#1372
fix(chroma): close transport before nulling to prevent subprocess leak#1372manuelfedele wants to merge 1 commit intothedotmack:mainfrom
Conversation
In callTool(), both the initial catch and retry catch blocks set this.transport = null without calling transport.close() first. When connectInternal() runs next, it sees transport === null, skips the cleanup, and spawns a new chroma-mcp subprocess while the old one is still alive. Over time this leaks dozens of orphaned uvx/chroma-mcp processes, each downloading a 79MB ONNX model, exhausting RAM and causing MCP search timeouts. Fix: save a reference to the stale transport, null the instance field, then close the saved reference. This ensures the subprocess receives SIGTERM before a replacement is spawned. Fixes thedotmack#1369
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Verdict: Approve-worthy ✅
This is a textbook resource-leak fix. The ordering bug (null reference before close) is subtle and the PR explains it clearly.
What's good:
- Saves stale reference before nulling — correct pattern for async cleanup
- Applied consistently to both the initial catch and retry catch paths
- The
try/catcharoundclose()is appropriate since the subprocess may already be dead - PR description with production evidence (25+ leaked processes, several GB RAM) makes the impact concrete
Potential improvements (non-blocking):
-
Race window on retry path: In the retry catch block,
this.clientandthis.transportare nulled but were they already set to new values by the retry'sensureConnected()? IfensureConnected()succeeded (new transport created) but the actual tool call then threw, you'd be closing the new transport rather than a stale one. Worth tracing the flow to confirm — the original code had the same issue though, so not a regression. -
Structured logging: Consider logging a warning when
staleTransport.close()is called so leaked-process cleanup is observable in logs. Would help confirm the fix is working in production without manual process counting. -
Test coverage gap: The PR notes existing SSL tests are unaffected but there's no test that actually validates the close-before-null ordering. A unit test that mocks
transport.close()and asserts it's called before a new transport is spawned would harden against future regressions. Not blocking for merge given the production urgency.
No security concerns. The fix is correct and addresses a meaningful production reliability issue.
|
Note: The Process Supervisor (PR #1370, v10.5.6) adds a process registry that tracks and reaps subprocess PIDs on session end. However, this PR's fix for the transport null-before-close race in |
Summary
Fixes #1369
callTool()setsthis.transport = nullwithout callingtransport.close()first. WhenconnectInternal()runs next, it seestransport === null, skips cleanup, and spawns a newchroma-mcpsubprocess while the old one is still alive. Over time this leaks dozens of orphaneduvx/chroma-mcpprocesses, each loading a 79MB ONNX model, exhausting RAM and causing MCP search timeouts.Root Cause
Two code paths in
callTool()drop the transport reference without closing it:this.transport, then callsensureConnected()which spawns a new subprocessthis.connected = falsewithout cleaning up transport/client at allMeanwhile,
connectInternal()tries to close the old transport:But since
callTool()already nulled it, this guard is always false, and the old subprocess lives on.Fix
Save a reference to the stale transport before nulling the instance field, then close the saved reference:
Applied to both the initial catch and retry catch blocks.
Impact
Without fix: each transport error accumulates an orphaned
chroma-mcpprocess. In production we observed 25+ leaked processes consuming several GB of RAM.With fix:
transport.close()sends SIGTERM to the subprocess before a replacement is spawned.Test plan