Skip to content

fix: detect stale PID files via health endpoint cross-check (#1231)#1357

Closed
ousamabenyounes wants to merge 2 commits intothedotmack:mainfrom
ousamabenyounes:fix/stale-pid-1231
Closed

fix: detect stale PID files via health endpoint cross-check (#1231)#1357
ousamabenyounes wants to merge 2 commits intothedotmack:mainfrom
ousamabenyounes:fix/stale-pid-1231

Conversation

@ousamabenyounes
Copy link
Contributor

Summary

  • Adds getHealthPid() to fetch the worker's actual PID from /api/health
  • After cleanStalePidFile() (which only checks kill -0), cross-checks the health endpoint PID against the PID file
  • Removes stale PID files when:
    • Health endpoint reports a different PID (PID reuse by unrelated process)
    • Process appears alive but doesn't respond to health checks (zombie worker after OOM/sleep/wake)

Fixes #1231

Test plan

  • 4 new tests for getHealthPid() (PID returned, connection refused, non-ok response, missing pid field)
  • All 1119 tests pass (0 failures)
  • Backward compatible: no behavior change when PID file and health endpoint agree

🤖 Coded by Claude, vibe-coded by Ousama Ben Younes

🤖 Generated with Claude Code

…otmack#1231)

After cleanStalePidFile() determines a PID is alive (kill -0), cross-check
the health endpoint's reported PID against the PID file. Removes stale PID
files when the health endpoint reports a different PID (PID reuse) or when
the process is alive but not responding to health checks (zombie worker).

Adds getHealthPid() utility and 4 tests covering the new function.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xkonjin
Copy link

xkonjin commented Mar 13, 2026

Code Review

Overview

This PR addresses issue #1231 by adding a health endpoint cross-check to detect stale PID files. The fix prevents PID reuse issues and zombie PID files after OOM/sleep/wake scenarios.

Strengths

  1. Robust Solution: The fix adds a double-check mechanism that verifies the PID file against the actual running worker's health endpoint

  2. Well-Tested: Added comprehensive unit tests for the new getHealthPid function, covering:

    • Successful PID retrieval
    • Worker not responding scenarios
    • Non-ok health responses
    • Missing pid field in response
  3. Good Logging: Added informative log messages for debugging stale PID file scenarios

  4. Graceful Degradation: The implementation handles failures gracefully by returning null and continuing execution

Issues & Concerns

None identified. The implementation is sound.

Security Concerns

None identified.

Potential Bugs

None identified. The error handling is comprehensive.

Test Coverage

Excellent test coverage for the new getHealthPid function:

  • Tests cover all error scenarios
  • Tests verify successful PID retrieval
  • Tests validate response parsing

Consider adding integration tests for the worker-service.ts changes to verify the end-to-end behavior of stale PID file detection.

Recommendations

  1. Consider adding an integration test that simulates a stale PID file scenario
  2. Document the health endpoint contract (response schema) in API documentation

Summary

This is a well-implemented fix that addresses a real-world issue with PID file management. The code is clean, well-tested, and includes good error handling and logging.

Overall Assessment: Approve

Simulates three scenarios: health reports different PID (PID reuse),
health unreachable (zombie worker), and PIDs match (healthy worker).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ousamabenyounes
Copy link
Contributor Author

Thanks for the review! Added the recommended integration tests in commit 2df3e77:

  • PID mismatch: simulates PID reuse where health endpoint reports a different PID than the PID file
  • Health unreachable: simulates zombie worker where PID file exists but worker doesn't respond
  • PIDs match: confirms healthy state when PID file and health endpoint agree

All 1113 tests pass, 0 failures.

@thedotmack
Copy link
Owner

Superseded by the embedded Process Supervisor (PR #1370, v10.5.6). Stale PID detection is now handled by validateWorkerPidFile() in src/supervisor/index.ts which checks liveness via process.kill(pid, 0) on every startup.

@thedotmack thedotmack closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker start command reports success with stale PID file (worker actually dead)

3 participants