-
Notifications
You must be signed in to change notification settings - Fork 340
Description
Problem
When a Hindsight worker pod is terminated (restart, deploy, OOM, node eviction), it does not release its claimed tasks in the async_operations table. The replacement pod gets a new hostname and will never pick up the old pod's tasks. The stuck task remains in processing state forever.
Observed behavior
- Worker
31acaa9db8aahad aconsolidationtask claimed since Feb 10, 2026 — stuck for 38 days - Worker stats logged
global: pending=1 | others: 31acaa9db8aa:1 | my_active: noneevery 30 seconds for the entire duration - The current pod had all 10 slots available but would not pick up the task because it was claimed by another worker
- 7 dead worker pods have historical task claims in the
async_operationstable from previous deployments
Impact
- Any
file_convert_retainorbatch_retaintask claimed by a dead worker is permanently lost - Uploaded documents never get processed — users see documents stuck in
pendinguntil the client-side polling times out global: pending=Ngrows over time with each pod restart
Evidence
2026-03-19 21:50:17,200 - INFO - hindsight_api.worker.poller - [WORKER_STATS]
worker=hindsight-robin-kb-service-hindsight-6f985468d6-h55lr
slots=0/10 (consolidation=0/2) | available=10 (consolidation=2) |
global: pending=1 (schemas: hindsight) | others: 31acaa9db8aa:1 | my_active: none
-- Stuck task in async_operations
SELECT operation_id, operation_type, status, worker_id, claimed_at
FROM hindsight.async_operations
WHERE status = 'processing' AND worker_id = '31acaa9db8aa';
-- Result: consolidation task claimed 2026-02-10, still "processing" 38 days laterCurrent workaround
Manual cancellation via the API:
DELETE /v1/default/banks/{bank_id}/operations/{stuck_operation_id}
Suggested fix
Add a stale task reclaim mechanism — standard in production task queues (Celery's visibility_timeout, SQS's message visibility, Sidekiq's death handler):
-
Option A (simplest): Periodic sweeper (cron job or background task) that runs every 5-10 minutes:
UPDATE async_operations SET status = 'pending', worker_id = NULL, claimed_at = NULL WHERE status = 'processing' AND claimed_at < NOW() - INTERVAL '30 minutes';
-
Option B (robust): Worker heartbeat table. Workers update their heartbeat every 30 seconds. Sweeper checks: if
worker_idhasn't heartbeated in 2 minutes, release all its tasks. -
Option C (Kubernetes-aware): On startup, query for tasks claimed by workers that are no longer running and release them.
Environment
- Hindsight deployed as a single-pod StatefulSet in Kubernetes
hindsight-clientSDK v0.4.14- PostgreSQL-backed
async_operationstask queue