feat(celery): finalize in-flight task on worker eviction so the chord proceeds by ocervell · Pull Request #1203 · freelabz/secator

ocervell · 2026-06-21T19:56:23Z

Problem

On a K8s pod eviction (SIGTERM → grace → SIGKILL) the worker running a task dies. With task_acks_late on the Redis broker, its message is only redelivered after the broker visibility timeout — which we derive above each pool's deadline (hours, on the long-task pool). So the surrounding chord/workflow stalls for that entire window.

This is the one OOM/eviction case the other layers don't cover:

Scenario	Recovery
Task timeout / memory soft-limit	secator monitor self-stops → partial results (existing)
Fast-spike OOM (child killed, master survives)	`task_reject_on_worker_lost` requeues immediately (~18s), capped by #1199
Whole-pod eviction (master dies too)	❌ nothing requeues fast on Redis → visibility-timeout stall

A whole-pod eviction has no surviving master, so reject_on_worker_lost can't help and Redis falls back to the visibility timeout.

Fix

Catch the worker_shutting_down signal and raise a flag the running task's monitor thread polls. On shutdown it stops the task early via the existing stop_process(exit_ok=True) path — the same self-stop already used for the timeout and memory-limit cases — so the task returns its partial results through the normal completion path and the chord proceeds in seconds instead of waiting for the visibility timeout.

celery_signals.py — SHUTDOWN_FLAG + worker_shutting_down_handler (wired unconditionally); stale flag cleared on worker boot. File-based so it crosses the prefork master→child boundary.
command.py — the monitor checks the flag, and its poll interval is capped (MONITOR_POLL_SECONDS=5) so an eviction is caught well within the pod's terminationGracePeriodSeconds (60s).

Goes through Celery's normal completion path, so the chord is satisfied the ordinary way — no pool change (stays prefork), no Celery upgrade, no revoke (which yields REVOKED/TaskRevokedError, the wrong tool here — that's for cancel, e.g. stop_runner).

Tests

test_shutdown_flag_lifecycle — flag set/clear.
test_monitor_stops_running_command_on_shutdown — a real sleep 30 command stops early (well under 30s) and emits the eviction Warning.

Validated end-to-end locally: evicting a worker mid-chord, the chord callback fired in ~5s (vs. the ~hours visibility timeout) with partial results from the evicted member.

Part of the OOM/eviction robustness set alongside #1199 (worker-loss retry cap) and #1202 (on_build runner identity).

🤖 Generated with Claude Code

https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm

Summary by CodeRabbit

New Features
- Implemented graceful worker shutdown that terminates running commands and automatically saves partial results, preventing indefinite task hangs during system shutdown
- Enhanced command monitoring to responsively detect shutdown signals and exit long-running tasks quickly
Tests
- Added comprehensive unit tests validating worker shutdown eviction and command termination behavior

… proceeds On a K8s pod eviction (SIGTERM -> grace -> SIGKILL) the worker running a task dies. With task_acks_late on the Redis broker, its message is only redelivered after the broker visibility timeout (hours, on the long-task pool) — stalling the surrounding chord/workflow that whole time. Unlike a child OOM (where the master survives and task_reject_on_worker_lost requeues immediately), a whole-pod eviction has no surviving master, so Redis can only fall back to the visibility timeout. Catch the worker_shutting_down signal and raise a flag the running task's monitor thread polls. On shutdown it stops the task early via the existing stop_process path, so the task returns its partial results through the normal completion path and the chord proceeds in seconds instead of waiting for the visibility timeout. - celery_signals.py: SHUTDOWN_FLAG + worker_shutting_down_handler (wired unconditionally); stale flag cleared on worker boot. - command.py: the monitor thread checks the flag (same self-stop used for timeout/memory limits) and caps its poll interval (MONITOR_POLL_SECONDS=5) so an eviction is caught well within the pod's terminationGracePeriodSeconds. - tests: flag lifecycle + a behavioral test (a running command stops early and emits the eviction warning). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm

coderabbitai · 2026-06-21T19:56:36Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 769d31df-c3ea-4935-90d6-11086a1bb6c3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

Adds a file-based worker shutdown flag (SHUTDOWN_FLAG) in celery_signals.py with helpers to set, check, and clear it. The flag is written when Celery fires worker_shutting_down. The command monitor loop in command.py polls this flag and, when set, sends SIGTERM to the running subprocess and exits early. A new unit test module validates the full eviction lifecycle.

Changes

Worker Eviction Self-Finalize

Layer / File(s)	Summary
Shutdown flag API and Celery signal wiring `secator/celery_signals.py`	Defines `SHUTDOWN_FLAG` path, `is_worker_shutting_down()`, `clear_shutdown_flag()`, and `worker_shutting_down_handler()` that writes the marker on eviction. `setup_handlers()` is updated to clear stale flags on boot and register the handler unconditionally with `signals.worker_shutting_down`.
Command monitor shutdown detection `secator/runners/command.py`	Adds `MONITOR_POLL_SECONDS = 5`, imports `is_worker_shutting_down` inside `_monitor_process`, inserts a shutdown branch that enqueues a `Warning` and calls `stop_process(exit_ok=True)` with `SIGTERM`, and caps the monitor sleep to `min(stat_update_frequency, MONITOR_POLL_SECONDS)` for faster reaction.
Eviction unit tests `tests/unit/test_eviction.py`	`TestEvictionSelfFinalize` verifies flag set/clear lifecycle, confirms a `sleep 30` command terminates promptly when the shutdown flag is raised mid-execution (with `MONITOR_POLL_SECONDS` patched to speed up polling), and checks that an eviction warning appears in results.

Sequence Diagram(s)

sequenceDiagram
    participant Celery as Celery Worker
    participant SignalHandler as worker_shutting_down_handler
    participant FlagFile as SHUTDOWN_FLAG (filesystem)
    participant Monitor as _monitor_process loop
    participant Subprocess as Running subprocess

    Celery->>SignalHandler: signals.worker_shutting_down fires
    SignalHandler->>FlagFile: touch(SHUTDOWN_FLAG)
    loop every min(stat_update_frequency, 5s)
        Monitor->>FlagFile: is_worker_shutting_down()
        FlagFile-->>Monitor: True
        Monitor->>Monitor: enqueue Warning("worker shutting down")
        Monitor->>Subprocess: SIGTERM via stop_process(exit_ok=True)
        Monitor->>Monitor: break — save incomplete results
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

freelabz/secator#727: Introduces the _monitor_process subsystem in secator/runners/command.py that this PR extends with the is_worker_shutting_down() early-exit branch.

Poem

🐇 Hop hop, the worker must go,
A flag file appears, soft and low.
The monitor peeks every five,
"SIGTERM!" — no process alive.
Incomplete tasks saved with grace,
The chord moves on to its rightful place! 🏁

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(celery): finalize in-flight task on worker eviction so the chord proceeds' directly summarizes the main change: gracefully finalizing in-flight tasks during worker eviction to allow chord workflows to proceed quickly.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/worker-eviction-finalize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@secator/celery_signals.py`:
- Around line 36-41: The worker_shutting_down_handler function silently swallows
all exceptions when writing to SHUTDOWN_FLAG, which can mask real issues like
missing or unwritable state directories and cause tasks to stall. Modify the
function to first ensure the parent directory of SHUTDOWN_FLAG exists by
creating it if necessary, then replace the broad exception handler with one that
specifically catches and logs at least OSError exceptions with an appropriate
error message, while still allowing the function to complete gracefully without
raising.

In `@tests/unit/test_eviction.py`:
- Around line 19-23: The setUp and tearDown methods are operating on the real
global shutdown flag file, which causes test interference when running tests in
parallel or with shared STATE_DIR. Modify the setUp method to patch
secator.celery_signals.SHUTDOWN_FLAG to point to a temporary test-specific path
before calling clear_shutdown_flag(), and modify the tearDown method to clean up
that patch after calling clear_shutdown_flag() to ensure each test uses an
isolated shutdown flag path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e0e1675a-0d72-4b1b-a024-c404d0ceaabe

📥 Commits

Reviewing files that changed from the base of the PR and between d39d440 and 4fe6fb1.

📒 Files selected for processing (3)

secator/celery_signals.py
secator/runners/command.py
tests/unit/test_eviction.py

…late test The integration suite caught a real leak: SHUTDOWN_FLAG is a machine-global file, so once any worker fired worker_shutting_down (e.g. between test files) the flag persisted and every later task whose monitor polled once self-aborted — slow tasks failed, fast ones that finished before the first poll passed. Prod never hit it (1 task = 1 pod = fresh /tmp), but any shared/long-lived worker does. - command.py: clear the flag at monitor start, so it only means "shutdown raised *during this run*", not a stale flag from a previous worker/task. - celery_signals.py (CodeRabbit): ensure the flag's parent dir exists and catch + log OSError instead of swallowing all exceptions. - test_eviction.py (CodeRabbit): isolate SHUTDOWN_FLAG to a per-test temp path; add a regression test that a flag set *before* a task starts does not stop it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm

…ace) The behavioral test set the flag once after a fixed 2s wait, but the monitor now clears any pre-existing flag at startup; on a slow CI runner the monitor started *after* that single set and wiped it, so the task never stopped and the test failed. Re-raise the flag in a loop until the task stops, so the monitor's poll sees it once it is running regardless of start timing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm

The real-subprocess behavioral test was environment-dependent: in CI the monitor thread didn't stop the live `sleep` reliably, so the test flaked (and the stale test passed for the wrong reason). Replace both with deterministic tests that exercise _monitor_process directly against a bare Command — one asserts it calls stop_process(exit_ok=True) + emits the eviction Warning when the flag is set, the other asserts it clears a stale flag at start (the integration-leak regression). No subprocess, no timing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm

The real-file version of the stale-flag test flaked on CI (patched temp path + bare-__new__ Command), while the real behaviour is already proven green by the integration suite. Assert that _monitor_process calls clear_shutdown_flag() at start via a mock instead of inspecting the filesystem — deterministic and environment-independent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread secator/celery_signals.py Outdated

Comment thread tests/unit/test_eviction.py

ocervell and others added 6 commits June 21, 2026 22:13

chore: re-trigger CI

10cce2e

chore: re-trigger CI

25e2dc8

ocervell added the feature:worker-reliability Celery worker reliability & redelivery label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(celery): finalize in-flight task on worker eviction so the chord proceeds#1203

feat(celery): finalize in-flight task on worker eviction so the chord proceeds#1203
ocervell wants to merge 7 commits into
mainfrom
feat/worker-eviction-finalize

ocervell commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Review skipped

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ocervell commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ocervell commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading