Skip to content

feat(celery): finalize in-flight task on worker eviction so the chord proceeds#1203

Open
ocervell wants to merge 7 commits into
mainfrom
feat/worker-eviction-finalize
Open

feat(celery): finalize in-flight task on worker eviction so the chord proceeds#1203
ocervell wants to merge 7 commits into
mainfrom
feat/worker-eviction-finalize

Conversation

@ocervell

@ocervell ocervell commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Problem

On a K8s pod eviction (SIGTERM → grace → SIGKILL) the worker running a task dies. With task_acks_late on the Redis broker, its message is only redelivered after the broker visibility timeout — which we derive above each pool's deadline (hours, on the long-task pool). So the surrounding chord/workflow stalls for that entire window.

This is the one OOM/eviction case the other layers don't cover:

Scenario Recovery
Task timeout / memory soft-limit secator monitor self-stops → partial results (existing)
Fast-spike OOM (child killed, master survives) task_reject_on_worker_lost requeues immediately (~18s), capped by #1199
Whole-pod eviction (master dies too) ❌ nothing requeues fast on Redis → visibility-timeout stall

A whole-pod eviction has no surviving master, so reject_on_worker_lost can't help and Redis falls back to the visibility timeout.

Fix

Catch the worker_shutting_down signal and raise a flag the running task's monitor thread polls. On shutdown it stops the task early via the existing stop_process(exit_ok=True) path — the same self-stop already used for the timeout and memory-limit cases — so the task returns its partial results through the normal completion path and the chord proceeds in seconds instead of waiting for the visibility timeout.

  • celery_signals.pySHUTDOWN_FLAG + worker_shutting_down_handler (wired unconditionally); stale flag cleared on worker boot. File-based so it crosses the prefork master→child boundary.
  • command.py — the monitor checks the flag, and its poll interval is capped (MONITOR_POLL_SECONDS=5) so an eviction is caught well within the pod's terminationGracePeriodSeconds (60s).

Goes through Celery's normal completion path, so the chord is satisfied the ordinary way — no pool change (stays prefork), no Celery upgrade, no revoke (which yields REVOKED/TaskRevokedError, the wrong tool here — that's for cancel, e.g. stop_runner).

Tests

  • test_shutdown_flag_lifecycle — flag set/clear.
  • test_monitor_stops_running_command_on_shutdown — a real sleep 30 command stops early (well under 30s) and emits the eviction Warning.

Validated end-to-end locally: evicting a worker mid-chord, the chord callback fired in ~5s (vs. the ~hours visibility timeout) with partial results from the evicted member.

Part of the OOM/eviction robustness set alongside #1199 (worker-loss retry cap) and #1202 (on_build runner identity).

🤖 Generated with Claude Code

https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm

Summary by CodeRabbit

  • New Features

    • Implemented graceful worker shutdown that terminates running commands and automatically saves partial results, preventing indefinite task hangs during system shutdown
    • Enhanced command monitoring to responsively detect shutdown signals and exit long-running tasks quickly
  • Tests

    • Added comprehensive unit tests validating worker shutdown eviction and command termination behavior

… proceeds

On a K8s pod eviction (SIGTERM -> grace -> SIGKILL) the worker running a task
dies. With task_acks_late on the Redis broker, its message is only redelivered
after the broker visibility timeout (hours, on the long-task pool) — stalling the
surrounding chord/workflow that whole time. Unlike a child OOM (where the master
survives and task_reject_on_worker_lost requeues immediately), a whole-pod
eviction has no surviving master, so Redis can only fall back to the visibility
timeout.

Catch the worker_shutting_down signal and raise a flag the running task's monitor
thread polls. On shutdown it stops the task early via the existing stop_process
path, so the task returns its partial results through the normal completion path
and the chord proceeds in seconds instead of waiting for the visibility timeout.

- celery_signals.py: SHUTDOWN_FLAG + worker_shutting_down_handler (wired
  unconditionally); stale flag cleared on worker boot.
- command.py: the monitor thread checks the flag (same self-stop used for
  timeout/memory limits) and caps its poll interval (MONITOR_POLL_SECONDS=5) so
  an eviction is caught well within the pod's terminationGracePeriodSeconds.
- tests: flag lifecycle + a behavioral test (a running command stops early and
  emits the eviction warning).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 769d31df-c3ea-4935-90d6-11086a1bb6c3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Adds a file-based worker shutdown flag (SHUTDOWN_FLAG) in celery_signals.py with helpers to set, check, and clear it. The flag is written when Celery fires worker_shutting_down. The command monitor loop in command.py polls this flag and, when set, sends SIGTERM to the running subprocess and exits early. A new unit test module validates the full eviction lifecycle.

Changes

Worker Eviction Self-Finalize

Layer / File(s) Summary
Shutdown flag API and Celery signal wiring
secator/celery_signals.py
Defines SHUTDOWN_FLAG path, is_worker_shutting_down(), clear_shutdown_flag(), and worker_shutting_down_handler() that writes the marker on eviction. setup_handlers() is updated to clear stale flags on boot and register the handler unconditionally with signals.worker_shutting_down.
Command monitor shutdown detection
secator/runners/command.py
Adds MONITOR_POLL_SECONDS = 5, imports is_worker_shutting_down inside _monitor_process, inserts a shutdown branch that enqueues a Warning and calls stop_process(exit_ok=True) with SIGTERM, and caps the monitor sleep to min(stat_update_frequency, MONITOR_POLL_SECONDS) for faster reaction.
Eviction unit tests
tests/unit/test_eviction.py
TestEvictionSelfFinalize verifies flag set/clear lifecycle, confirms a sleep 30 command terminates promptly when the shutdown flag is raised mid-execution (with MONITOR_POLL_SECONDS patched to speed up polling), and checks that an eviction warning appears in results.

Sequence Diagram(s)

sequenceDiagram
    participant Celery as Celery Worker
    participant SignalHandler as worker_shutting_down_handler
    participant FlagFile as SHUTDOWN_FLAG (filesystem)
    participant Monitor as _monitor_process loop
    participant Subprocess as Running subprocess

    Celery->>SignalHandler: signals.worker_shutting_down fires
    SignalHandler->>FlagFile: touch(SHUTDOWN_FLAG)
    loop every min(stat_update_frequency, 5s)
        Monitor->>FlagFile: is_worker_shutting_down()
        FlagFile-->>Monitor: True
        Monitor->>Monitor: enqueue Warning("worker shutting down")
        Monitor->>Subprocess: SIGTERM via stop_process(exit_ok=True)
        Monitor->>Monitor: break — save incomplete results
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • freelabz/secator#727: Introduces the _monitor_process subsystem in secator/runners/command.py that this PR extends with the is_worker_shutting_down() early-exit branch.

Poem

🐇 Hop hop, the worker must go,
A flag file appears, soft and low.
The monitor peeks every five,
"SIGTERM!" — no process alive.
Incomplete tasks saved with grace,
The chord moves on to its rightful place! 🏁

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(celery): finalize in-flight task on worker eviction so the chord proceeds' directly summarizes the main change: gracefully finalizing in-flight tasks during worker eviction to allow chord workflows to proceed quickly.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/worker-eviction-finalize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@secator/celery_signals.py`:
- Around line 36-41: The worker_shutting_down_handler function silently swallows
all exceptions when writing to SHUTDOWN_FLAG, which can mask real issues like
missing or unwritable state directories and cause tasks to stall. Modify the
function to first ensure the parent directory of SHUTDOWN_FLAG exists by
creating it if necessary, then replace the broad exception handler with one that
specifically catches and logs at least OSError exceptions with an appropriate
error message, while still allowing the function to complete gracefully without
raising.

In `@tests/unit/test_eviction.py`:
- Around line 19-23: The setUp and tearDown methods are operating on the real
global shutdown flag file, which causes test interference when running tests in
parallel or with shared STATE_DIR. Modify the setUp method to patch
secator.celery_signals.SHUTDOWN_FLAG to point to a temporary test-specific path
before calling clear_shutdown_flag(), and modify the tearDown method to clean up
that patch after calling clear_shutdown_flag() to ensure each test uses an
isolated shutdown flag path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e0e1675a-0d72-4b1b-a024-c404d0ceaabe

📥 Commits

Reviewing files that changed from the base of the PR and between d39d440 and 4fe6fb1.

📒 Files selected for processing (3)
  • secator/celery_signals.py
  • secator/runners/command.py
  • tests/unit/test_eviction.py

Comment thread secator/celery_signals.py Outdated
Comment thread tests/unit/test_eviction.py
ocervell and others added 6 commits June 21, 2026 22:13
…late test

The integration suite caught a real leak: SHUTDOWN_FLAG is a machine-global file,
so once any worker fired worker_shutting_down (e.g. between test files) the flag
persisted and every later task whose monitor polled once self-aborted — slow
tasks failed, fast ones that finished before the first poll passed. Prod never
hit it (1 task = 1 pod = fresh /tmp), but any shared/long-lived worker does.

- command.py: clear the flag at monitor start, so it only means "shutdown raised
  *during this run*", not a stale flag from a previous worker/task.
- celery_signals.py (CodeRabbit): ensure the flag's parent dir exists and catch +
  log OSError instead of swallowing all exceptions.
- test_eviction.py (CodeRabbit): isolate SHUTDOWN_FLAG to a per-test temp path;
  add a regression test that a flag set *before* a task starts does not stop it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm
…ace)

The behavioral test set the flag once after a fixed 2s wait, but the monitor now
clears any pre-existing flag at startup; on a slow CI runner the monitor started
*after* that single set and wiped it, so the task never stopped and the test
failed. Re-raise the flag in a loop until the task stops, so the monitor's poll
sees it once it is running regardless of start timing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm
The real-subprocess behavioral test was environment-dependent: in CI the monitor
thread didn't stop the live `sleep` reliably, so the test flaked (and the stale
test passed for the wrong reason). Replace both with deterministic tests that
exercise _monitor_process directly against a bare Command — one asserts it calls
stop_process(exit_ok=True) + emits the eviction Warning when the flag is set,
the other asserts it clears a stale flag at start (the integration-leak
regression). No subprocess, no timing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm
The real-file version of the stale-flag test flaked on CI (patched temp path +
bare-__new__ Command), while the real behaviour is already proven green by the
integration suite. Assert that _monitor_process calls clear_shutdown_flag() at
start via a mock instead of inspecting the filesystem — deterministic and
environment-independent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm
@ocervell ocervell added the feature:worker-reliability Celery worker reliability & redelivery label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature:worker-reliability Celery worker reliability & redelivery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant