Dead worker pods hold task claims indefinitely — no stale task reclaim

## Problem

When a Hindsight worker pod is terminated (restart, deploy, OOM, node eviction), it does not release its claimed tasks in the `async_operations` table. The replacement pod gets a new hostname and will **never** pick up the old pod's tasks. The stuck task remains in `processing` state forever.

## Observed behavior

- Worker `31acaa9db8aa` had a `consolidation` task claimed since **Feb 10, 2026** — stuck for 38 days
- Worker stats logged `global: pending=1 | others: 31acaa9db8aa:1 | my_active: none` every 30 seconds for the entire duration
- The current pod had all 10 slots available but would not pick up the task because it was claimed by another worker
- 7 dead worker pods have historical task claims in the `async_operations` table from previous deployments

## Impact

- Any `file_convert_retain` or `batch_retain` task claimed by a dead worker is permanently lost
- Uploaded documents never get processed — users see documents stuck in `pending` until the client-side polling times out
- `global: pending=N` grows over time with each pod restart

## Evidence

```
2026-03-19 21:50:17,200 - INFO - hindsight_api.worker.poller - [WORKER_STATS]
  worker=hindsight-robin-kb-service-hindsight-6f985468d6-h55lr
  slots=0/10 (consolidation=0/2) | available=10 (consolidation=2) |
  global: pending=1 (schemas: hindsight) | others: 31acaa9db8aa:1 | my_active: none
```

```sql
-- Stuck task in async_operations
SELECT operation_id, operation_type, status, worker_id, claimed_at
FROM hindsight.async_operations
WHERE status = 'processing' AND worker_id = '31acaa9db8aa';

-- Result: consolidation task claimed 2026-02-10, still "processing" 38 days later
```

## Current workaround

Manual cancellation via the API:
```
DELETE /v1/default/banks/{bank_id}/operations/{stuck_operation_id}
```

## Suggested fix

Add a **stale task reclaim mechanism** — standard in production task queues (Celery's `visibility_timeout`, SQS's message visibility, Sidekiq's death handler):

- **Option A (simplest):** Periodic sweeper (cron job or background task) that runs every 5-10 minutes:
  ```sql
  UPDATE async_operations
  SET status = 'pending', worker_id = NULL, claimed_at = NULL
  WHERE status = 'processing'
    AND claimed_at < NOW() - INTERVAL '30 minutes';
  ```

- **Option B (robust):** Worker heartbeat table. Workers update their heartbeat every 30 seconds. Sweeper checks: if `worker_id` hasn't heartbeated in 2 minutes, release all its tasks.

- **Option C (Kubernetes-aware):** On startup, query for tasks claimed by workers that are no longer running and release them.

## Environment

- Hindsight deployed as a single-pod StatefulSet in Kubernetes
- `hindsight-client` SDK v0.4.14
- PostgreSQL-backed `async_operations` task queue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead worker pods hold task claims indefinitely — no stale task reclaim #624

Problem

Observed behavior

Impact

Evidence

Current workaround

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dead worker pods hold task claims indefinitely — no stale task reclaim #624

Description

Problem

Observed behavior

Impact

Evidence

Current workaround

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions