Skip to content

fix(chroma): close transport before nulling to prevent subprocess leak#1372

Open
manuelfedele wants to merge 1 commit intothedotmack:mainfrom
manuelfedele:fix/chroma-mcp-subprocess-leak-1369
Open

fix(chroma): close transport before nulling to prevent subprocess leak#1372
manuelfedele wants to merge 1 commit intothedotmack:mainfrom
manuelfedele:fix/chroma-mcp-subprocess-leak-1369

Conversation

@manuelfedele
Copy link

Summary

Fixes #1369

callTool() sets this.transport = null without calling transport.close() first. When connectInternal() runs next, it sees transport === null, skips cleanup, and spawns a new chroma-mcp subprocess while the old one is still alive. Over time this leaks dozens of orphaned uvx/chroma-mcp processes, each loading a 79MB ONNX model, exhausting RAM and causing MCP search timeouts.

Root Cause

Two code paths in callTool() drop the transport reference without closing it:

  1. Initial catch block (transport error) - nulls this.transport, then calls ensureConnected() which spawns a new subprocess
  2. Retry catch block - only sets this.connected = false without cleaning up transport/client at all

Meanwhile, connectInternal() tries to close the old transport:

if (this.transport) try { await this.transport.close() } catch {}

But since callTool() already nulled it, this guard is always false, and the old subprocess lives on.

Fix

Save a reference to the stale transport before nulling the instance field, then close the saved reference:

const staleTransport = this.transport;
this.client = null;
this.transport = null;
if (staleTransport) {
  try { await staleTransport.close(); } catch { /* subprocess may already be dead */ }
}

Applied to both the initial catch and retry catch blocks.

Impact

Without fix: each transport error accumulates an orphaned chroma-mcp process. In production we observed 25+ leaked processes consuming several GB of RAM.

With fix: transport.close() sends SIGTERM to the subprocess before a replacement is spawned.

Test plan

  • Verified fix on local machine: transport errors no longer accumulate orphaned processes
  • Existing SSL test suite unaffected (tests mock transport)

In callTool(), both the initial catch and retry catch blocks set
this.transport = null without calling transport.close() first.
When connectInternal() runs next, it sees transport === null,
skips the cleanup, and spawns a new chroma-mcp subprocess while
the old one is still alive. Over time this leaks dozens of orphaned
uvx/chroma-mcp processes, each downloading a 79MB ONNX model,
exhausting RAM and causing MCP search timeouts.

Fix: save a reference to the stale transport, null the instance
field, then close the saved reference. This ensures the subprocess
receives SIGTERM before a replacement is spawned.

Fixes thedotmack#1369
Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Verdict: Approve-worthy

This is a textbook resource-leak fix. The ordering bug (null reference before close) is subtle and the PR explains it clearly.

What's good:

  • Saves stale reference before nulling — correct pattern for async cleanup
  • Applied consistently to both the initial catch and retry catch paths
  • The try/catch around close() is appropriate since the subprocess may already be dead
  • PR description with production evidence (25+ leaked processes, several GB RAM) makes the impact concrete

Potential improvements (non-blocking):

  1. Race window on retry path: In the retry catch block, this.client and this.transport are nulled but were they already set to new values by the retry's ensureConnected()? If ensureConnected() succeeded (new transport created) but the actual tool call then threw, you'd be closing the new transport rather than a stale one. Worth tracing the flow to confirm — the original code had the same issue though, so not a regression.

  2. Structured logging: Consider logging a warning when staleTransport.close() is called so leaked-process cleanup is observable in logs. Would help confirm the fix is working in production without manual process counting.

  3. Test coverage gap: The PR notes existing SSL tests are unaffected but there's no test that actually validates the close-before-null ordering. A unit test that mocks transport.close() and asserts it's called before a new transport is spawned would harden against future regressions. Not blocking for merge given the production urgency.

No security concerns. The fix is correct and addresses a meaningful production reliability issue.

@thedotmack
Copy link
Owner

Note: The Process Supervisor (PR #1370, v10.5.6) adds a process registry that tracks and reaps subprocess PIDs on session end. However, this PR's fix for the transport null-before-close race in ChromaMcpManager is still needed — the supervisor catches orphans after the fact, but this PR prevents the leak at the source. Should still land.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chroma-mcp subprocess leak: callTool and onclose null transport before connectInternal can close it

3 participants