Collector is single-threaded bottleneck for all job engine operations

## Summary

The job engine collector (`collector_loop` in `wbia/web/job_engine.py`) is a single-threaded blocking loop that serializes ALL job status queries, result storage, and metadata operations. Combined with the client-side `_collect_lock` mutex, this creates a convoy effect under concurrent load.

## Architecture

```
[Gunicorn Thread 1] ─┐
[Gunicorn Thread 2] ─┤── _collect_lock ──→ [ZMQ DEALER] ──→ [Collector ROUTER]
[Gunicorn Thread 3] ─┤                                        (single-threaded)
    ...               │                                        shelve I/O
[Gunicorn Thread 16]──┘                                        callback HTTP
```

### Why the collector is inherently single-threaded

This is **not just an implementation choice** — three hard constraints force serialization:

1. **Python `shelve` (DBM backend)**: Not thread-safe, not multi-process-safe. All job metadata and results are persisted via shelve. Concurrent access causes corruption.

2. **`collector_data` dict**: The in-memory job index is a plain Python dict in a single process with no synchronization.

3. **ZMQ ROUTER socket**: ZMQ sockets are not thread-safe. The recv → process → send loop must run on one thread.

Making the collector multi-threaded would require replacing all three layers.

### Client side (web worker process)
- `_collect_lock` (`threading.Lock`) ensures only one thread talks to the collector at a time
- All 16 threads compete for this single lock
- Lock is held for the entire ZMQ send → recv round-trip (up to 120s timeout)

### Server side (collector process)
- `collector_loop` (line 1882): `while True: recv → process → reply`
- Completely single-threaded — no async, no thread pool
- Shelve I/O (`get_shelve_value`, `set_shelve_value`) is blocking
- HTTP callbacks to external services are blocking

## Impact

1. **Convoy effect**: If request A takes 5 seconds in the collector, requests B through P queue behind it.

2. **Heartbeat starvation**: `prometheus_update()` calls `get_job_status()` which acquires `_collect_lock`. While the collector processes this, all other threads are blocked.

3. **Timeout cascading**: If the collector is slow, threads hit the 120s RCVTIMEO timeout. After a timeout, the DEALER socket must be destroyed and recreated.

## Mitigations Already Applied

These reduce the pressure but don't remove the bottleneck:

- **Aggressive caching** in `JOB_STATUS_CACHE` — fewer shelve reads per request
- **Batch registration** at startup — one ZMQ round-trip instead of one per historical job
- **`PROMETHEUS_LIMIT=30`** — expensive prometheus refresh runs ~once/minute instead of every heartbeat
- **`_PROMETHEUS_BUSY` guard** — at most one thread runs prometheus refresh at a time
- **`limit` parameter** on `job_status_dict` — collector only processes/serializes N jobs instead of all

## What Would Actually Fix This

The real fix is eliminating the collector as a bottleneck by replacing its storage layer:

1. **Replace shelve with SQLite/PostgreSQL** for job metadata and results. Both handle concurrent reads natively.
2. **Query the job database directly from web threads** instead of routing through ZMQ to the collector. This eliminates both `_collect_lock` contention and the single-threaded collector process.
3. This is essentially what **Celery + Redis/PostgreSQL** provides out of the box — a standard task queue with a concurrent result backend, monitoring, and horizontal scaling.

The ZMQ + shelve architecture was appropriate when WBIA was a single-user research tool, but it doesn't scale to production multi-threaded serving.

## Files

- `wbia/web/job_engine.py:1914` — `collector_loop`
- `wbia/web/job_engine.py:886-897` — `_collect_request` with lock
- `wbia/web/job_engine.py:873-884` — `_engine_request` with lock
- `wbia/web/job_engine.py:681-695` — `get_shelve_value` (shelve I/O)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collector is single-threaded bottleneck for all job engine operations #315

Summary

Architecture

Why the collector is inherently single-threaded

Client side (web worker process)

Server side (collector process)

Impact

Mitigations Already Applied

What Would Actually Fix This

Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Collector is single-threaded bottleneck for all job engine operations #315

Description

Summary

Architecture

Why the collector is inherently single-threaded

Client side (web worker process)

Server side (collector process)

Impact

Mitigations Already Applied

What Would Actually Fix This

Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions