Skip to content

fix: prevent unbounded Claude subprocess spawning in worker daemon#1008

Closed
jayvenn21 wants to merge 2 commits intothedotmack:mainfrom
jayvenn21:fix/unbounded-claude-subprocess-spawn
Closed

fix: prevent unbounded Claude subprocess spawning in worker daemon#1008
jayvenn21 wants to merge 2 commits intothedotmack:mainfrom
jayvenn21:fix/unbounded-claude-subprocess-spawn

Conversation

@jayvenn21
Copy link
Contributor

Summary

This PR introduces defensive process lifecycle controls to prevent the
worker-service.cjs --daemon from spawning unbounded Claude CLI subprocesses
that are never cleaned up, which can lead to catastrophic memory, swap, and disk
exhaustion over time.

This change does not alter core functionality or user-facing behavior under
normal workloads. It adds guardrails and cleanup to ensure long-running daemon
sessions remain stable.

Problem

On macOS (and potentially other platforms), the daemon currently:

  • spawns Claude CLI subprocesses without a hard upper bound
  • does not reliably terminate or reap completed / idle children
  • allows orphaned subprocesses to accumulate indefinitely

Over multi-day usage, this results in:

  • hundreds of orphaned claude processes
  • tens of GB of RAM + swap consumption
  • disk exhaustion and cascading service failures (e.g. Redis, Docker)

Multiple related issues report the same failure mode.

Changes

This PR introduces minimal but critical safety mechanisms:

  • Track spawned Claude subprocesses explicitly
  • Enforce a maximum concurrent subprocess limit
  • Ensure child processes are terminated and removed on task completion
  • Add defensive cleanup on daemon startup to reap stale children
  • Log warnings when limits are hit instead of silently spawning indefinitely

These changes are intentionally conservative and avoid architectural rewrites.

Why this approach

This is a production-safety fix:

  • prevents resource exhaustion
  • avoids breaking existing workflows
  • buys time for a more comprehensive worker-pool redesign if desired

The goal is to fail safely rather than catastrophically.

Testing

  • Verified subprocess cleanup under repeated task execution
  • Verified daemon remains stable over extended runtime
  • Confirmed no behavior change for normal interactive usage

Related issues

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 7, 2026

Greptile Overview

Greptile Summary

This PR adds defensive lifecycle controls around Claude CLI subprocesses in the worker daemon:

  • ProcessRegistry now tracks spawned subprocesses and enforces a hard cap (MAX_CONCURRENT_CLAUDE_SUBPROCESSES).
  • WorkerService invokes an extra startup reap to kill stale/orphaned Claude processes from prior runs, and continues running the periodic orphan reaper.

Overall this fits the existing approach of process management/cleanup in src/services/worker/ProcessRegistry.ts and the worker’s background initialization flow in src/services/worker-service.ts to keep long-running daemons stable and avoid runaway resource usage.

Confidence Score: 3/5

  • This PR is directionally safe but has a couple of correctness edge cases that can kill valid subprocesses or wedge spawning under load.
  • The added guardrails are valuable, but the startup reap currently calls reapOrphanedProcesses with an empty active-session set (which can SIGKILL any registry-tracked children if any are running at that moment). Separately, cap enforcement is based on registry size, so if the registry gets out of sync with live processes the worker can start refusing spawns until restart.
  • src/services/worker-service.ts, src/services/worker/ProcessRegistry.ts

Important Files Changed

Filename Overview
src/services/worker-service.ts Adds a startup call to reap orphaned Claude subprocesses after background initialization; current implementation passes an empty active session set which can SIGKILL valid in-flight subprocesses.
src/services/worker/ProcessRegistry.ts Adds a max concurrent subprocess cap and exposes active process count; cap enforcement throws when registry size reaches limit, which can wedge spawning if the registry gets out of sync with live processes.

Sequence Diagram

sequenceDiagram
  autonumber
  participant WS as WorkerService
  participant PR as ProcessRegistry
  participant OS as OS/ps
  participant CP as Claude subprocess

  WS->>WS: initializeBackground()
  WS->>WS: initializationCompleteFlag=true
  WS->>WS: resolveInitialization()
  WS->>PR: reapOrphanedProcesses(activeSessionIds=∅)
  PR->>PR: iterate processRegistry
  PR-->>CP: SIGKILL tracked children not in activeSessionIds
  PR->>OS: ps | grep claude... (killSystemOrphans)
  OS-->>PR: orphan list (ppid=1)
  PR-->>CP: SIGKILL system orphans
  WS->>PR: startOrphanReaper(getActiveSessionIds)
  loop every 5 minutes
    PR->>PR: reapOrphanedProcesses(activeSessionIds)
    PR->>OS: killSystemOrphans()
  end

  Note over PR,CP: createPidCapturingSpawn enforces MAX_CONCURRENT_CLAUDE_SUBPROCESSES
  WS->>PR: createPidCapturingSpawn(sessionDbId)
  PR-->>CP: spawn(command,args)
  PR->>PR: registerProcess(pid, sessionDbId)
  CP-->>PR: 'exit' event
  PR->>PR: unregisterProcess(pid)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@jayvenn21
Copy link
Contributor Author

It looks like the failing claude-review check is due to missing OIDC
permissions rather than a code issue.

The error indicates the GitHub Action cannot fetch an ID token
(ACTIONS_ID_TOKEN_REQUEST_URL is not set), which commonly happens on
forked PRs unless the workflow explicitly enables:

permissions:
id-token: write

Since this is a contributor PR from a fork, the action may need to be
disabled for forks, made non-blocking, or the permissions adjusted by a
maintainer.

Happy to rebase or rerun once that’s resolved.

@JGrubb
Copy link

JGrubb commented Feb 10, 2026

See here also - #1010

@jayvenn21
Copy link
Contributor Author

See here also - #1010

Sounds good. I'll take a look at it.

@thedotmack
Copy link
Owner

Closing in favor of #1085 (rodboev) which provides a more comprehensive solution to the same problem. Thank you for the contribution!

@thedotmack thedotmack closed this Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants