Skip to content

fix: critical bug fixes, snap sandbox support, and resource monitoring#1102

Open
chadru wants to merge 8 commits intothedotmack:mainfrom
chadru:fix/critical-bug-fixes
Open

fix: critical bug fixes, snap sandbox support, and resource monitoring#1102
chadru wants to merge 8 commits intothedotmack:mainfrom
chadru:fix/critical-bug-fixes

Conversation

@chadru
Copy link

@chadru chadru commented Feb 14, 2026

Summary

8 commits rebased cleanly onto v10.0.7 (post PR #792 HTTP Chroma migration).

  • 7 critical bug fixes: stuck messages, race conditions, path handling, process leaks, hook timeouts, CORS, search wildcards, migration idempotency
  • Bun snap sandbox support: resolves node/claude/uvx paths when running under snap confinement (common on Ubuntu/WSL) — fixes MCP, SDK agent, and Chroma vector search
  • ResourceMonitor: lightweight memory/token leak detection agent (30s sampling, ring buffer, anomaly alerts)
  • MCP non-blocking init: core worker initializes immediately instead of waiting up to 5 minutes for MCP connection
  • Greptile review fixes: PATH dedup uses exact segment matching (split(':') instead of substring), MCP timeout promise cleanup on success/failure

Fixes addressed

  • Process leaks: ensureProcessExit() on idle timeout and natural completion
  • Hook timeouts: fetchWithTimeout across all 6 hook handlers (context, file-edit, observation, user-message, session-complete, session-init)
  • CORS: regex-based origin matching for localhost variants
  • Search: wildcard * queries routed to SQLite instead of Chroma
  • Migration: idempotent Migration004 with table existence check
  • isProjectRoot: git subdirectory detection via git rev-parse --show-toplevel
  • CLAUDECODE env var stripped from SDK subprocess to prevent nested session rejection
  • Snap PATH: derives nvm bin dir from CLAUDE_CODE_PATH, resolves real homedir for ~/.local/bin
  • MCP timeout: clearTimeout() in both success and failure paths to prevent timer leak

Test plan

  • Build passes (npm run build)
  • 907 tests pass, 32 fail (pre-existing — unrelated to this PR), 0 regressions
  • Worker starts healthy on port 37777 with status: ok
  • Rebased cleanly onto v10.0.7 (no conflicts with HTTP Chroma from PR feat: Replace MCP subprocess with persistent Chroma HTTP server #792)
  • All 3 Greptile review comments addressed
  • Tested on Ubuntu 24.04 WSL2 with Bun snap (v1.3.9) and nvm node v22.22.0

🤖 Generated with Claude Code

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 14, 2026

Greptile Overview

Greptile Summary

This PR delivers 7 critical bug fixes, Bun snap sandbox support, and resource monitoring infrastructure. The changes address production stability issues: process leaks fixed by calling ensureProcessExit() after abort(), race conditions resolved with atomic transactions wrapping observation storage + message confirmation, stuck messages recovered via 60s periodic cleanup, hook timeouts prevented with fetchWithTimeout() wrapper, CORS improved with regex-based origin matching, wildcard search routed to SQLite, migration idempotency achieved with table existence checks, and git subdirectory detection fixed using git rev-parse. Snap sandbox support resolves PATH restrictions by deriving nvm bin directory from CLAUDE_CODE_PATH setting and augmenting PATH with ~/.local/bin. ResourceMonitor provides lightweight token/memory leak detection with 30s sampling, ring buffer history, and anomaly alerts. MCP initialization is now non-blocking (60s timeout) so core functionality initializes immediately instead of waiting up to 5 minutes. The changes are well-structured, defensive, and include proper error handling.

Confidence Score: 4/5

  • This PR is safe to merge with minor considerations around testing edge cases
  • Score reflects comprehensive bug fixes addressing critical production issues (process leaks, race conditions, stuck messages). All changes are defensive and include proper error handling. ResourceMonitor is observability-only (no side effects). The snap sandbox support is well-isolated. Main consideration is the scope of changes (26 files) requiring thorough integration testing.
  • Pay close attention to src/services/worker-service.ts (MCP non-blocking init timing) and src/services/worker/agents/ResponseProcessor.ts (atomic transaction pattern)

Important Files Changed

Filename Overview
src/services/infrastructure/ResourceMonitor.ts New ResourceMonitor with memory/token leak detection. Well-structured with ring buffer, anomaly detection algorithms, and configurable thresholds.
src/shared/EnvManager.ts Snap sandbox support added: resolves node/claude/uvx paths from CLAUDE_CODE_PATH setting, fixes Bun snap PATH restrictions, strips CLAUDECODE env var.
src/services/worker-service.ts MCP non-blocking init (60s timeout), stuck message recovery interval, ResourceMonitor integration, PATH augmentation for snap support.
src/services/worker/SessionManager.ts Process leak fix: onIdleTimeout now calls ensureProcessExit() after abort() to guarantee subprocess termination (Issues #1010, #1068, #1089, #1090).
src/services/worker/agents/ResponseProcessor.ts Critical race condition fix: atomic transaction wraps storeObservations + confirmProcessed to prevent duplicate observations on crash (Issues #1036, #1091).
src/cli/handlers/session-init.ts Hook timeout fix: replaced fetch with fetchWithTimeout (15s) to prevent indefinite hangs.
src/services/sqlite/migrations/runner.ts Migration idempotency fix: checks table existence and uses INSERT OR IGNORE to handle interrupted migrations (Issue #979).
src/services/worker/http/middleware.ts CORS regex fix: proper localhost/127.0.0.1/[::1] matching with optional port numbers (Issue #1029).
src/services/worker/search/SearchOrchestrator.ts Wildcard search fix: treats '*' as filter-only query, routes to SQLite instead of Chroma (Issue #714).
src/utils/claude-md-utils.ts Git subdirectory detection fix: uses 'git rev-parse --show-toplevel' instead of just checking for .git directory (Issue #793).

Flowchart

flowchart TD
    A[Hook Handlers] -->|fetchWithTimeout 10-15s| B[Worker Service API]
    B -->|Queue Message| C[PendingMessageStore]
    C -->|MAX_QUEUE_DEPTH=100| D{Queue Full?}
    D -->|Yes| E[Drop Oldest Pending]
    D -->|No| F[Enqueue Message]
    
    G[SessionManager] -->|Claim & Process| C
    G -->|Spawns| H[Claude SDK Agent]
    H -->|Generates| I[Observations]
    
    I -->|Atomic Transaction| J[ResponseProcessor]
    J -->|1. Store Observations| K[SQLite DB]
    J -->|2. Confirm Messages| C
    
    L[Idle Timeout] -->|abort + ensureProcessExit| H
    M[Stuck Message Recovery] -->|Every 60s| C
    M -->|Reset processing > 5min| N[Back to Pending]
    
    O[ResourceMonitor] -->|30s Sampling| P[Memory + Tokens]
    P -->|Detect Anomalies| Q[Alerts]
    
    R[Snap Sandbox] -->|resolveRuntimeBinDir| S[CLAUDE_CODE_PATH]
    S -->|Augment PATH| T[node/claude/uvx]
    
    U[MCP Init] -->|Non-blocking 60s| V{Success?}
    V -->|Yes| W[Vector Search Available]
    V -->|No| X[Core Still Works]
Loading

Last reviewed commit: e5eeb5b

@thedotmack
Copy link
Owner

@chadru have you ever seen a greptile chart THIS clean??? This is a REALLY great job, running it to see how it goes! :)

@chadru
Copy link
Author

chadru commented Feb 14, 2026

Thank you, working on merging it with the new release. It looks like it dropped while I was working on it.

@chadru chadru closed this Feb 14, 2026
@chadru chadru reopened this Feb 15, 2026
chadru and others added 6 commits February 14, 2026 17:02
…th handling

1. Periodic stuck message recovery (thedotmack#1036, thedotmack#1052): Add 60s interval to
   reset messages stuck in 'processing' for >5min back to 'pending'.

2. Atomic observation storage + confirmation (thedotmack#1036, thedotmack#1091): Wrap
   storeObservations() and confirmProcessed() in a single db.transaction()
   to prevent duplicate observations on worker crash.

3. Empty project race condition (thedotmack#1046): Pass cwd as project fallback
   when creating sessions from PostToolUse observations, leveraging
   existing backfill logic in createSDKSession().

4. Queue depth limit: Add MAX_QUEUE_DEPTH=100 to PendingMessageStore
   to prevent unbounded queue growth when SDK agent is stuck/failing.

5. HealthMonitor ENOENT crash (thedotmack#1042): Add ENOENT/EBUSY handling to
   getInstalledPluginVersion() matching existing worker-utils.ts pattern.

6. ChromaSync hardcoded path: Replace os.homedir() hardcoded paths with
   VECTOR_DB_DIR and DATA_DIR from centralized paths.ts module.

7. smart-install.js hardcoded paths (thedotmack#1030): Respect CLAUDE_CONFIG_DIR
   env var for XDG-compatible config directory resolution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…y, and CLAUDE.md scope

- Fix zombie subprocess leaks: verify process exit on idle timeout and natural completion (thedotmack#1010, thedotmack#1068, thedotmack#1089, thedotmack#1090)
- Add fetchWithTimeout to all hook handlers: prevent indefinite hangs when worker is slow (thedotmack#1079, thedotmack#730)
- Fix CORS origin matching: handle localhost without port and IPv6 [::1] (thedotmack#1029)
- Fix wildcard search: route query="*" to SQLite instead of Chroma which can't handle it (thedotmack#714)
- Fix migration004 idempotency: check both version tracking AND table existence for partial migrations (thedotmack#979)
- Fix isProjectRoot: detect subdirectories within git repos using git rev-parse (thedotmack#793)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New infrastructure service that samples process.memoryUsage() and per-session
token rates every 30s, maintains a 1-hour ring buffer of snapshots, and
detects anomalies:

- Memory leak: alerts on monotonic RSS growth >20% over 10 samples
- High memory: alerts when RSS exceeds 512MB
- Token runaway: alerts when any session exceeds 50k tokens/min
- Exposes diagnostics via GET /api/diagnostics/resources endpoint
- Alert deduplication within 5-minute windows
- Clean shutdown with interval cleanup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MCP connection (vector search) was blocking initializeBackground() with a
5-minute timeout. If MCP failed to connect, the entire worker was stuck:
no orphan reaper, no resource monitor, no stuck message recovery, and all
data API endpoints returned "Service initializing" indefinitely.

Now core init completes first (DB, search routes, orphan reaper, resource
monitor), then MCP connects in the background. Timeout reduced from 5min
to 60s. MCP failure only disables vector search, not the whole worker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Bun snap sandbox strips user PATH, preventing the worker from
finding node and claude executables installed via nvm. Derives the
runtime bin directory from the CLAUDE_CODE_PATH setting and augments
PATH in both the MCP transport env and SDK agent isolated env.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code sets CLAUDECODE=1 to prevent nested sessions. The worker
daemon inherits this from hook invocations, causing the SDK agent's
spawned claude subprocess to refuse to start (exit code 1). Add
CLAUDECODE to the blocked env vars list alongside ANTHROPIC_API_KEY.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

26 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +440 to +449
Promise.race([mcpConnectionPromise, timeoutPromise])
.then(() => {
this.mcpReady = true;
logger.success('WORKER', 'Connected to MCP server');
})
.catch((error) => {
logger.warn('WORKER', 'MCP connection failed (vector search unavailable)', {
error: error instanceof Error ? error.message : String(error)
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout promise never cleaned up if connection succeeds

Suggested change
Promise.race([mcpConnectionPromise, timeoutPromise])
.then(() => {
this.mcpReady = true;
logger.success('WORKER', 'Connected to MCP server');
})
.catch((error) => {
logger.warn('WORKER', 'MCP connection failed (vector search unavailable)', {
error: error instanceof Error ? error.message : String(error)
});
});
const timeoutId = setTimeout(() => reject(new Error('MCP connection timeout after 60s')), MCP_INIT_TIMEOUT_MS);
const timeoutPromise = new Promise<never>((_, reject) => timeoutId);
Promise.race([mcpConnectionPromise, timeoutPromise])
.then(() => {
clearTimeout(timeoutId);
this.mcpReady = true;
logger.success('WORKER', 'Connected to MCP server');
})
.catch((error) => {
clearTimeout(timeoutId);
logger.warn('WORKER', 'MCP connection failed (vector search unavailable)', {
error: error instanceof Error ? error.message : String(error)
});
});

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b9d2f89 — added clearTimeout(timeoutId) in both .then() and .catch() paths.

const extraDirs = [runtimeBinDir, existsSync(localBinDir) ? localBinDir : null].filter(Boolean) as string[];
if (extraDirs.length > 0) {
const currentPath = isolatedEnv.PATH || '';
const missing = extraDirs.filter(d => !currentPath.includes(d));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substring match may cause false positives (e.g., /usr/local/bin contains /usr/bin)

Suggested change
const missing = extraDirs.filter(d => !currentPath.includes(d));
const missing = extraDirs.filter(d => !currentPath.split(':').includes(d));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b9d2f89 — now using currentPath.split(':').includes(d) for exact segment matching.

const localBin = path.join(realHome, '.local', 'bin');
const extraDirs = [binDir, existsSync(localBin) ? localBin : null].filter(Boolean) as string[];
const currentPath = process.env.PATH || '';
const missingDirs = extraDirs.filter(d => !currentPath.includes(d));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substring match may cause false positives (e.g., /usr/local/bin contains /usr/bin)

Suggested change
const missingDirs = extraDirs.filter(d => !currentPath.includes(d));
const missingDirs = extraDirs.filter(d => !currentPath.split(':').includes(d));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b9d2f89 — same split(':') fix applied here.

os.homedir() returns /home/user/snap/bun-js/87 under the Bun snap
sandbox, but uvx is installed at /home/user/.local/bin. Derive the
real home directory by stripping the snap suffix from HOME env var.
Fixes Chroma vector search and ~/.local/bin tool discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@chadru chadru force-pushed the fix/critical-bug-fixes branch from e5eeb5b to 8af451f Compare February 15, 2026 00:05
- Use split(':') for PATH dedup to prevent false positives (e.g. /usr/bin matching /usr/local/bin)
- Clear MCP timeout promise on both success and failure to prevent timer leak

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@thedotmack
Copy link
Owner

This omnibus PR contains excellent fixes (fetchWithTimeout, atomic store+confirm, subprocess exit verification, migration idempotency, and more). It now has conflicts due to recently merged PRs (#995, #1022, #1031, #1112) touching overlapping files. Could you rebase onto main? The code quality is high and this is next in line for merge after rebase.

Copy link

@Chriscross475 Chriscross475 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR addresses critical bug fixes but has blockers:

  1. No CI checks running - Need GitHub Actions to verify build
  2. 32 test failures reported - Even if pre-existing, they need to be addressed or documented separately from this PR
  3. Large scope (1053 additions) - Multiple concerns mixed: bug fixes, snap support, resource monitoring, MCP changes

Recommendations:

  • Split into smaller, focused PRs (one for bug fixes, one for snap support, one for ResourceMonitor)
  • Fix the 32 test failures or confirm they're unrelated
  • Get CI passing before requesting review

Once CI is green and tests pass, I'll review the implementation.

@thedotmack
Copy link
Owner

Note: The Process Supervisor (PR #1370, v10.5.6) now covers several of the concerns in this PR (sandbox support, resource monitoring foundations, PID management). Please rebase and check what's still needed vs what's now redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants