fix: critical bug fixes, snap sandbox support, and resource monitoring#1102
fix: critical bug fixes, snap sandbox support, and resource monitoring#1102chadru wants to merge 8 commits intothedotmack:mainfrom
Conversation
Greptile OverviewGreptile SummaryThis PR delivers 7 critical bug fixes, Bun snap sandbox support, and resource monitoring infrastructure. The changes address production stability issues: process leaks fixed by calling Confidence Score: 4/5
Important Files Changed
Flowchartflowchart TD
A[Hook Handlers] -->|fetchWithTimeout 10-15s| B[Worker Service API]
B -->|Queue Message| C[PendingMessageStore]
C -->|MAX_QUEUE_DEPTH=100| D{Queue Full?}
D -->|Yes| E[Drop Oldest Pending]
D -->|No| F[Enqueue Message]
G[SessionManager] -->|Claim & Process| C
G -->|Spawns| H[Claude SDK Agent]
H -->|Generates| I[Observations]
I -->|Atomic Transaction| J[ResponseProcessor]
J -->|1. Store Observations| K[SQLite DB]
J -->|2. Confirm Messages| C
L[Idle Timeout] -->|abort + ensureProcessExit| H
M[Stuck Message Recovery] -->|Every 60s| C
M -->|Reset processing > 5min| N[Back to Pending]
O[ResourceMonitor] -->|30s Sampling| P[Memory + Tokens]
P -->|Detect Anomalies| Q[Alerts]
R[Snap Sandbox] -->|resolveRuntimeBinDir| S[CLAUDE_CODE_PATH]
S -->|Augment PATH| T[node/claude/uvx]
U[MCP Init] -->|Non-blocking 60s| V{Success?}
V -->|Yes| W[Vector Search Available]
V -->|No| X[Core Still Works]
Last reviewed commit: e5eeb5b |
|
@chadru have you ever seen a greptile chart THIS clean??? This is a REALLY great job, running it to see how it goes! :) |
|
Thank you, working on merging it with the new release. It looks like it dropped while I was working on it. |
…th handling 1. Periodic stuck message recovery (thedotmack#1036, thedotmack#1052): Add 60s interval to reset messages stuck in 'processing' for >5min back to 'pending'. 2. Atomic observation storage + confirmation (thedotmack#1036, thedotmack#1091): Wrap storeObservations() and confirmProcessed() in a single db.transaction() to prevent duplicate observations on worker crash. 3. Empty project race condition (thedotmack#1046): Pass cwd as project fallback when creating sessions from PostToolUse observations, leveraging existing backfill logic in createSDKSession(). 4. Queue depth limit: Add MAX_QUEUE_DEPTH=100 to PendingMessageStore to prevent unbounded queue growth when SDK agent is stuck/failing. 5. HealthMonitor ENOENT crash (thedotmack#1042): Add ENOENT/EBUSY handling to getInstalledPluginVersion() matching existing worker-utils.ts pattern. 6. ChromaSync hardcoded path: Replace os.homedir() hardcoded paths with VECTOR_DB_DIR and DATA_DIR from centralized paths.ts module. 7. smart-install.js hardcoded paths (thedotmack#1030): Respect CLAUDE_CONFIG_DIR env var for XDG-compatible config directory resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…y, and CLAUDE.md scope - Fix zombie subprocess leaks: verify process exit on idle timeout and natural completion (thedotmack#1010, thedotmack#1068, thedotmack#1089, thedotmack#1090) - Add fetchWithTimeout to all hook handlers: prevent indefinite hangs when worker is slow (thedotmack#1079, thedotmack#730) - Fix CORS origin matching: handle localhost without port and IPv6 [::1] (thedotmack#1029) - Fix wildcard search: route query="*" to SQLite instead of Chroma which can't handle it (thedotmack#714) - Fix migration004 idempotency: check both version tracking AND table existence for partial migrations (thedotmack#979) - Fix isProjectRoot: detect subdirectories within git repos using git rev-parse (thedotmack#793) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New infrastructure service that samples process.memoryUsage() and per-session token rates every 30s, maintains a 1-hour ring buffer of snapshots, and detects anomalies: - Memory leak: alerts on monotonic RSS growth >20% over 10 samples - High memory: alerts when RSS exceeds 512MB - Token runaway: alerts when any session exceeds 50k tokens/min - Exposes diagnostics via GET /api/diagnostics/resources endpoint - Alert deduplication within 5-minute windows - Clean shutdown with interval cleanup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MCP connection (vector search) was blocking initializeBackground() with a 5-minute timeout. If MCP failed to connect, the entire worker was stuck: no orphan reaper, no resource monitor, no stuck message recovery, and all data API endpoints returned "Service initializing" indefinitely. Now core init completes first (DB, search routes, orphan reaper, resource monitor), then MCP connects in the background. Timeout reduced from 5min to 60s. MCP failure only disables vector search, not the whole worker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Bun snap sandbox strips user PATH, preventing the worker from finding node and claude executables installed via nvm. Derives the runtime bin directory from the CLAUDE_CODE_PATH setting and augments PATH in both the MCP transport env and SDK agent isolated env. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code sets CLAUDECODE=1 to prevent nested sessions. The worker daemon inherits this from hook invocations, causing the SDK agent's spawned claude subprocess to refuse to start (exit code 1). Add CLAUDECODE to the blocked env vars list alongside ANTHROPIC_API_KEY. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| Promise.race([mcpConnectionPromise, timeoutPromise]) | ||
| .then(() => { | ||
| this.mcpReady = true; | ||
| logger.success('WORKER', 'Connected to MCP server'); | ||
| }) | ||
| .catch((error) => { | ||
| logger.warn('WORKER', 'MCP connection failed (vector search unavailable)', { | ||
| error: error instanceof Error ? error.message : String(error) | ||
| }); | ||
| }); |
There was a problem hiding this comment.
timeout promise never cleaned up if connection succeeds
| Promise.race([mcpConnectionPromise, timeoutPromise]) | |
| .then(() => { | |
| this.mcpReady = true; | |
| logger.success('WORKER', 'Connected to MCP server'); | |
| }) | |
| .catch((error) => { | |
| logger.warn('WORKER', 'MCP connection failed (vector search unavailable)', { | |
| error: error instanceof Error ? error.message : String(error) | |
| }); | |
| }); | |
| const timeoutId = setTimeout(() => reject(new Error('MCP connection timeout after 60s')), MCP_INIT_TIMEOUT_MS); | |
| const timeoutPromise = new Promise<never>((_, reject) => timeoutId); | |
| Promise.race([mcpConnectionPromise, timeoutPromise]) | |
| .then(() => { | |
| clearTimeout(timeoutId); | |
| this.mcpReady = true; | |
| logger.success('WORKER', 'Connected to MCP server'); | |
| }) | |
| .catch((error) => { | |
| clearTimeout(timeoutId); | |
| logger.warn('WORKER', 'MCP connection failed (vector search unavailable)', { | |
| error: error instanceof Error ? error.message : String(error) | |
| }); | |
| }); |
There was a problem hiding this comment.
Fixed in b9d2f89 — added clearTimeout(timeoutId) in both .then() and .catch() paths.
src/shared/EnvManager.ts
Outdated
| const extraDirs = [runtimeBinDir, existsSync(localBinDir) ? localBinDir : null].filter(Boolean) as string[]; | ||
| if (extraDirs.length > 0) { | ||
| const currentPath = isolatedEnv.PATH || ''; | ||
| const missing = extraDirs.filter(d => !currentPath.includes(d)); |
There was a problem hiding this comment.
substring match may cause false positives (e.g., /usr/local/bin contains /usr/bin)
| const missing = extraDirs.filter(d => !currentPath.includes(d)); | |
| const missing = extraDirs.filter(d => !currentPath.split(':').includes(d)); |
There was a problem hiding this comment.
Fixed in b9d2f89 — now using currentPath.split(':').includes(d) for exact segment matching.
src/services/worker-service.ts
Outdated
| const localBin = path.join(realHome, '.local', 'bin'); | ||
| const extraDirs = [binDir, existsSync(localBin) ? localBin : null].filter(Boolean) as string[]; | ||
| const currentPath = process.env.PATH || ''; | ||
| const missingDirs = extraDirs.filter(d => !currentPath.includes(d)); |
There was a problem hiding this comment.
substring match may cause false positives (e.g., /usr/local/bin contains /usr/bin)
| const missingDirs = extraDirs.filter(d => !currentPath.includes(d)); | |
| const missingDirs = extraDirs.filter(d => !currentPath.split(':').includes(d)); |
There was a problem hiding this comment.
Fixed in b9d2f89 — same split(':') fix applied here.
os.homedir() returns /home/user/snap/bun-js/87 under the Bun snap sandbox, but uvx is installed at /home/user/.local/bin. Derive the real home directory by stripping the snap suffix from HOME env var. Fixes Chroma vector search and ~/.local/bin tool discovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e5eeb5b to
8af451f
Compare
- Use split(':') for PATH dedup to prevent false positives (e.g. /usr/bin matching /usr/local/bin)
- Clear MCP timeout promise on both success and failure to prevent timer leak
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This omnibus PR contains excellent fixes (fetchWithTimeout, atomic store+confirm, subprocess exit verification, migration idempotency, and more). It now has conflicts due to recently merged PRs (#995, #1022, #1031, #1112) touching overlapping files. Could you rebase onto main? The code quality is high and this is next in line for merge after rebase. |
Chriscross475
left a comment
There was a problem hiding this comment.
This PR addresses critical bug fixes but has blockers:
- No CI checks running - Need GitHub Actions to verify build
- 32 test failures reported - Even if pre-existing, they need to be addressed or documented separately from this PR
- Large scope (1053 additions) - Multiple concerns mixed: bug fixes, snap support, resource monitoring, MCP changes
Recommendations:
- Split into smaller, focused PRs (one for bug fixes, one for snap support, one for ResourceMonitor)
- Fix the 32 test failures or confirm they're unrelated
- Get CI passing before requesting review
Once CI is green and tests pass, I'll review the implementation.
|
Note: The Process Supervisor (PR #1370, v10.5.6) now covers several of the concerns in this PR (sandbox support, resource monitoring foundations, PID management). Please rebase and check what's still needed vs what's now redundant. |
Summary
8 commits rebased cleanly onto v10.0.7 (post PR #792 HTTP Chroma migration).
split(':')instead of substring), MCP timeout promise cleanup on success/failureFixes addressed
ensureProcessExit()on idle timeout and natural completionfetchWithTimeoutacross all 6 hook handlers (context, file-edit, observation, user-message, session-complete, session-init)*queries routed to SQLite instead of ChromaisProjectRoot: git subdirectory detection viagit rev-parse --show-toplevelCLAUDECODEenv var stripped from SDK subprocess to prevent nested session rejectionCLAUDE_CODE_PATH, resolves real homedir for~/.local/binclearTimeout()in both success and failure paths to prevent timer leakTest plan
npm run build)status: ok🤖 Generated with Claude Code