Skip to content

Dead worker pods hold task claims indefinitely — no stale task reclaim #624

@jr-neeve

Description

@jr-neeve

Problem

When a Hindsight worker pod is terminated (restart, deploy, OOM, node eviction), it does not release its claimed tasks in the async_operations table. The replacement pod gets a new hostname and will never pick up the old pod's tasks. The stuck task remains in processing state forever.

Observed behavior

  • Worker 31acaa9db8aa had a consolidation task claimed since Feb 10, 2026 — stuck for 38 days
  • Worker stats logged global: pending=1 | others: 31acaa9db8aa:1 | my_active: none every 30 seconds for the entire duration
  • The current pod had all 10 slots available but would not pick up the task because it was claimed by another worker
  • 7 dead worker pods have historical task claims in the async_operations table from previous deployments

Impact

  • Any file_convert_retain or batch_retain task claimed by a dead worker is permanently lost
  • Uploaded documents never get processed — users see documents stuck in pending until the client-side polling times out
  • global: pending=N grows over time with each pod restart

Evidence

2026-03-19 21:50:17,200 - INFO - hindsight_api.worker.poller - [WORKER_STATS]
  worker=hindsight-robin-kb-service-hindsight-6f985468d6-h55lr
  slots=0/10 (consolidation=0/2) | available=10 (consolidation=2) |
  global: pending=1 (schemas: hindsight) | others: 31acaa9db8aa:1 | my_active: none
-- Stuck task in async_operations
SELECT operation_id, operation_type, status, worker_id, claimed_at
FROM hindsight.async_operations
WHERE status = 'processing' AND worker_id = '31acaa9db8aa';

-- Result: consolidation task claimed 2026-02-10, still "processing" 38 days later

Current workaround

Manual cancellation via the API:

DELETE /v1/default/banks/{bank_id}/operations/{stuck_operation_id}

Suggested fix

Add a stale task reclaim mechanism — standard in production task queues (Celery's visibility_timeout, SQS's message visibility, Sidekiq's death handler):

  • Option A (simplest): Periodic sweeper (cron job or background task) that runs every 5-10 minutes:

    UPDATE async_operations
    SET status = 'pending', worker_id = NULL, claimed_at = NULL
    WHERE status = 'processing'
      AND claimed_at < NOW() - INTERVAL '30 minutes';
  • Option B (robust): Worker heartbeat table. Workers update their heartbeat every 30 seconds. Sweeper checks: if worker_id hasn't heartbeated in 2 minutes, release all its tasks.

  • Option C (Kubernetes-aware): On startup, query for tasks claimed by workers that are no longer running and release them.

Environment

  • Hindsight deployed as a single-pod StatefulSet in Kubernetes
  • hindsight-client SDK v0.4.14
  • PostgreSQL-backed async_operations task queue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions