fix(chroma): close transport before nulling to prevent subprocess leak by manuelfedele · Pull Request #1372 · thedotmack/claude-mem

manuelfedele · 2026-03-16T08:37:57Z

Summary

callTool() sets this.transport = null without calling transport.close() first. When connectInternal() runs next, it sees transport === null, skips cleanup, and spawns a new chroma-mcp subprocess while the old one is still alive. Over time this leaks dozens of orphaned uvx/chroma-mcp processes, each loading a 79MB ONNX model, exhausting RAM and causing MCP search timeouts.

Root Cause

Two code paths in callTool() drop the transport reference without closing it:

Initial catch block (transport error) - nulls this.transport, then calls ensureConnected() which spawns a new subprocess
Retry catch block - only sets this.connected = false without cleaning up transport/client at all

Meanwhile, connectInternal() tries to close the old transport:

if (this.transport) try { await this.transport.close() } catch {}

But since callTool() already nulled it, this guard is always false, and the old subprocess lives on.

Fix

Save a reference to the stale transport before nulling the instance field, then close the saved reference:

const staleTransport = this.transport;
this.client = null;
this.transport = null;
if (staleTransport) {
  try { await staleTransport.close(); } catch { /* subprocess may already be dead */ }
}

Applied to both the initial catch and retry catch blocks.

Impact

Without fix: each transport error accumulates an orphaned chroma-mcp process. In production we observed 25+ leaked processes consuming several GB of RAM.

With fix: transport.close() sends SIGTERM to the subprocess before a replacement is spawned.

Test plan

Verified fix on local machine: transport errors no longer accumulate orphaned processes
Existing SSL test suite unaffected (tests mock transport)

In callTool(), both the initial catch and retry catch blocks set this.transport = null without calling transport.close() first. When connectInternal() runs next, it sees transport === null, skips the cleanup, and spawns a new chroma-mcp subprocess while the old one is still alive. Over time this leaks dozens of orphaned uvx/chroma-mcp processes, each downloading a 79MB ONNX model, exhausting RAM and causing MCP search timeouts. Fix: save a reference to the stale transport, null the instance field, then close the saved reference. This ensures the subprocess receives SIGTERM before a replacement is spawned. Fixes thedotmack#1369

xkonjin

Code Review

Verdict: Approve-worthy ✅

This is a textbook resource-leak fix. The ordering bug (null reference before close) is subtle and the PR explains it clearly.

What's good:

Saves stale reference before nulling — correct pattern for async cleanup
Applied consistently to both the initial catch and retry catch paths
The try/catch around close() is appropriate since the subprocess may already be dead
PR description with production evidence (25+ leaked processes, several GB RAM) makes the impact concrete

Potential improvements (non-blocking):

Race window on retry path: In the retry catch block, this.client and this.transport are nulled but were they already set to new values by the retry's ensureConnected()? If ensureConnected() succeeded (new transport created) but the actual tool call then threw, you'd be closing the new transport rather than a stale one. Worth tracing the flow to confirm — the original code had the same issue though, so not a regression.
Structured logging: Consider logging a warning when staleTransport.close() is called so leaked-process cleanup is observable in logs. Would help confirm the fix is working in production without manual process counting.
Test coverage gap: The PR notes existing SSL tests are unaffected but there's no test that actually validates the close-before-null ordering. A unit test that mocks transport.close() and asserts it's called before a new transport is spawned would harden against future regressions. Not blocking for merge given the production urgency.

No security concerns. The fix is correct and addresses a meaningful production reliability issue.

thedotmack · 2026-03-16T22:05:53Z

Note: The Process Supervisor (PR #1370, v10.5.6) adds a process registry that tracks and reaps subprocess PIDs on session end. However, this PR's fix for the transport null-before-close race in ChromaMcpManager is still needed — the supervisor catches orphans after the fact, but this PR prevents the leak at the source. Should still land.

xkonjin reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(chroma): close transport before nulling to prevent subprocess leak#1372

fix(chroma): close transport before nulling to prevent subprocess leak#1372
manuelfedele wants to merge 1 commit intothedotmack:mainfrom
manuelfedele:fix/chroma-mcp-subprocess-leak-1369

manuelfedele commented Mar 16, 2026

Uh oh!

xkonjin left a comment

Uh oh!

thedotmack commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

manuelfedele commented Mar 16, 2026

Summary

Root Cause

Fix

Impact

Test plan

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

thedotmack commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants