Summary
The job engine collector (collector_loop in wbia/web/job_engine.py) is a single-threaded blocking loop that serializes ALL job status queries, result storage, and metadata operations. Combined with the client-side _collect_lock mutex, this creates a convoy effect under concurrent load.
Architecture
[Gunicorn Thread 1] ─┐
[Gunicorn Thread 2] ─┤── _collect_lock ──→ [ZMQ DEALER] ──→ [Collector ROUTER]
[Gunicorn Thread 3] ─┤ (single-threaded)
... │ shelve I/O
[Gunicorn Thread 16]──┘ callback HTTP
Why the collector is inherently single-threaded
This is not just an implementation choice — three hard constraints force serialization:
-
Python shelve (DBM backend): Not thread-safe, not multi-process-safe. All job metadata and results are persisted via shelve. Concurrent access causes corruption.
-
collector_data dict: The in-memory job index is a plain Python dict in a single process with no synchronization.
-
ZMQ ROUTER socket: ZMQ sockets are not thread-safe. The recv → process → send loop must run on one thread.
Making the collector multi-threaded would require replacing all three layers.
Client side (web worker process)
_collect_lock (threading.Lock) ensures only one thread talks to the collector at a time
- All 16 threads compete for this single lock
- Lock is held for the entire ZMQ send → recv round-trip (up to 120s timeout)
Server side (collector process)
collector_loop (line 1882): while True: recv → process → reply
- Completely single-threaded — no async, no thread pool
- Shelve I/O (
get_shelve_value, set_shelve_value) is blocking
- HTTP callbacks to external services are blocking
Impact
-
Convoy effect: If request A takes 5 seconds in the collector, requests B through P queue behind it.
-
Heartbeat starvation: prometheus_update() calls get_job_status() which acquires _collect_lock. While the collector processes this, all other threads are blocked.
-
Timeout cascading: If the collector is slow, threads hit the 120s RCVTIMEO timeout. After a timeout, the DEALER socket must be destroyed and recreated.
Mitigations Already Applied
These reduce the pressure but don't remove the bottleneck:
- Aggressive caching in
JOB_STATUS_CACHE — fewer shelve reads per request
- Batch registration at startup — one ZMQ round-trip instead of one per historical job
PROMETHEUS_LIMIT=30 — expensive prometheus refresh runs ~once/minute instead of every heartbeat
_PROMETHEUS_BUSY guard — at most one thread runs prometheus refresh at a time
limit parameter on job_status_dict — collector only processes/serializes N jobs instead of all
What Would Actually Fix This
The real fix is eliminating the collector as a bottleneck by replacing its storage layer:
- Replace shelve with SQLite/PostgreSQL for job metadata and results. Both handle concurrent reads natively.
- Query the job database directly from web threads instead of routing through ZMQ to the collector. This eliminates both
_collect_lock contention and the single-threaded collector process.
- This is essentially what Celery + Redis/PostgreSQL provides out of the box — a standard task queue with a concurrent result backend, monitoring, and horizontal scaling.
The ZMQ + shelve architecture was appropriate when WBIA was a single-user research tool, but it doesn't scale to production multi-threaded serving.
Files
wbia/web/job_engine.py:1914 — collector_loop
wbia/web/job_engine.py:886-897 — _collect_request with lock
wbia/web/job_engine.py:873-884 — _engine_request with lock
wbia/web/job_engine.py:681-695 — get_shelve_value (shelve I/O)
Summary
The job engine collector (
collector_loopinwbia/web/job_engine.py) is a single-threaded blocking loop that serializes ALL job status queries, result storage, and metadata operations. Combined with the client-side_collect_lockmutex, this creates a convoy effect under concurrent load.Architecture
Why the collector is inherently single-threaded
This is not just an implementation choice — three hard constraints force serialization:
Python
shelve(DBM backend): Not thread-safe, not multi-process-safe. All job metadata and results are persisted via shelve. Concurrent access causes corruption.collector_datadict: The in-memory job index is a plain Python dict in a single process with no synchronization.ZMQ ROUTER socket: ZMQ sockets are not thread-safe. The recv → process → send loop must run on one thread.
Making the collector multi-threaded would require replacing all three layers.
Client side (web worker process)
_collect_lock(threading.Lock) ensures only one thread talks to the collector at a timeServer side (collector process)
collector_loop(line 1882):while True: recv → process → replyget_shelve_value,set_shelve_value) is blockingImpact
Convoy effect: If request A takes 5 seconds in the collector, requests B through P queue behind it.
Heartbeat starvation:
prometheus_update()callsget_job_status()which acquires_collect_lock. While the collector processes this, all other threads are blocked.Timeout cascading: If the collector is slow, threads hit the 120s RCVTIMEO timeout. After a timeout, the DEALER socket must be destroyed and recreated.
Mitigations Already Applied
These reduce the pressure but don't remove the bottleneck:
JOB_STATUS_CACHE— fewer shelve reads per requestPROMETHEUS_LIMIT=30— expensive prometheus refresh runs ~once/minute instead of every heartbeat_PROMETHEUS_BUSYguard — at most one thread runs prometheus refresh at a timelimitparameter onjob_status_dict— collector only processes/serializes N jobs instead of allWhat Would Actually Fix This
The real fix is eliminating the collector as a bottleneck by replacing its storage layer:
_collect_lockcontention and the single-threaded collector process.The ZMQ + shelve architecture was appropriate when WBIA was a single-user research tool, but it doesn't scale to production multi-threaded serving.
Files
wbia/web/job_engine.py:1914—collector_loopwbia/web/job_engine.py:886-897—_collect_requestwith lockwbia/web/job_engine.py:873-884—_engine_requestwith lockwbia/web/job_engine.py:681-695—get_shelve_value(shelve I/O)