Skip to content

Collector is single-threaded bottleneck for all job engine operations #315

@JasonWildMe

Description

@JasonWildMe

Summary

The job engine collector (collector_loop in wbia/web/job_engine.py) is a single-threaded blocking loop that serializes ALL job status queries, result storage, and metadata operations. Combined with the client-side _collect_lock mutex, this creates a convoy effect under concurrent load.

Architecture

[Gunicorn Thread 1] ─┐
[Gunicorn Thread 2] ─┤── _collect_lock ──→ [ZMQ DEALER] ──→ [Collector ROUTER]
[Gunicorn Thread 3] ─┤                                        (single-threaded)
    ...               │                                        shelve I/O
[Gunicorn Thread 16]──┘                                        callback HTTP

Why the collector is inherently single-threaded

This is not just an implementation choice — three hard constraints force serialization:

  1. Python shelve (DBM backend): Not thread-safe, not multi-process-safe. All job metadata and results are persisted via shelve. Concurrent access causes corruption.

  2. collector_data dict: The in-memory job index is a plain Python dict in a single process with no synchronization.

  3. ZMQ ROUTER socket: ZMQ sockets are not thread-safe. The recv → process → send loop must run on one thread.

Making the collector multi-threaded would require replacing all three layers.

Client side (web worker process)

  • _collect_lock (threading.Lock) ensures only one thread talks to the collector at a time
  • All 16 threads compete for this single lock
  • Lock is held for the entire ZMQ send → recv round-trip (up to 120s timeout)

Server side (collector process)

  • collector_loop (line 1882): while True: recv → process → reply
  • Completely single-threaded — no async, no thread pool
  • Shelve I/O (get_shelve_value, set_shelve_value) is blocking
  • HTTP callbacks to external services are blocking

Impact

  1. Convoy effect: If request A takes 5 seconds in the collector, requests B through P queue behind it.

  2. Heartbeat starvation: prometheus_update() calls get_job_status() which acquires _collect_lock. While the collector processes this, all other threads are blocked.

  3. Timeout cascading: If the collector is slow, threads hit the 120s RCVTIMEO timeout. After a timeout, the DEALER socket must be destroyed and recreated.

Mitigations Already Applied

These reduce the pressure but don't remove the bottleneck:

  • Aggressive caching in JOB_STATUS_CACHE — fewer shelve reads per request
  • Batch registration at startup — one ZMQ round-trip instead of one per historical job
  • PROMETHEUS_LIMIT=30 — expensive prometheus refresh runs ~once/minute instead of every heartbeat
  • _PROMETHEUS_BUSY guard — at most one thread runs prometheus refresh at a time
  • limit parameter on job_status_dict — collector only processes/serializes N jobs instead of all

What Would Actually Fix This

The real fix is eliminating the collector as a bottleneck by replacing its storage layer:

  1. Replace shelve with SQLite/PostgreSQL for job metadata and results. Both handle concurrent reads natively.
  2. Query the job database directly from web threads instead of routing through ZMQ to the collector. This eliminates both _collect_lock contention and the single-threaded collector process.
  3. This is essentially what Celery + Redis/PostgreSQL provides out of the box — a standard task queue with a concurrent result backend, monitoring, and horizontal scaling.

The ZMQ + shelve architecture was appropriate when WBIA was a single-user research tool, but it doesn't scale to production multi-threaded serving.

Files

  • wbia/web/job_engine.py:1914collector_loop
  • wbia/web/job_engine.py:886-897_collect_request with lock
  • wbia/web/job_engine.py:873-884_engine_request with lock
  • wbia/web/job_engine.py:681-695get_shelve_value (shelve I/O)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions