Skip to content

follow-up(cron): long cron runs hold the global serialization lock for the entire run_job() execution #1574

@nesquena-hermes

Description

@nesquena-hermes

Long cron runs hold the global serialization lock for the entire run_job() execution

Summary

PR #1571 (which closes #1573) protects cron API endpoints from cross-profile state corruption by wrapping every call in a cron_profile_context() guarded by a process-wide lock (_cron_env_lock). Correctness-wise this is right, but the lock is held across the entire run_job() execution inside _run_cron_tracked — which can be minutes for a real agent run.

While a long cron run is executing, every other cron API call (sidebar refresh, manual run on a different job, even a same-profile read) blocks on the lock. The Scheduled Jobs panel will appear frozen until the run completes.

Why this isn't blocking #1573

For correctness we need the env+module-constant patch to remain stable across the run, and cron.jobs keeps HERMES_DIR / CRON_DIR / JOBS_FILE / OUTPUT_DIR as module-level globals. Releasing the lock during run_job() would let a concurrent request from a different profile clobber those constants mid-run. Better-correctness-and-frozen-UI > worse-correctness-and-responsive-UI.

But on a busy multi-profile instance, "frozen for minutes when any profile runs a cron job" is poor UX.

Proposed designs (any one, follow-up)

Option A — Per-thread constant resolution

Replace cron.jobs module globals with a per-thread resolver function. Each thread reads its own HERMES_HOME from a threading.local(). Lock would only be needed for the brief env-var swap window, not the full run_job().

Cost: larger upstream change to hermes-agent/cron/jobs.py. Roughly: replace every JOBS_FILE.exists() style reference with _get_jobs_file().exists() where the resolver reads the TLS home.

Option B — Pass paths into run_job(job, *, hermes_home) explicitly

Make the run-time paths a parameter rather than module state. WebUI's _run_cron_tracked would resolve the home, derive paths, pass into run_job(). No global mutation, no lock needed for the run itself.

Cost: also upstream — changes run_job() signature. Backward-compat shim possible.

Option C — Subprocess-isolated cron runs

Spawn run_job() in a multiprocessing.Process with HERMES_HOME set in the child env. Parent webui process never mutates its own env. Lock only protects the env-snapshot used to seed the child, not the child's lifetime.

Cost: more work but cleanest isolation. Process startup latency added per cron run (~50-200ms); negligible relative to a multi-minute agent run.

Option D — Status quo + UX hint

Keep current design but add a "Cron job running — Scheduled Jobs panel will refresh when it completes" banner in the panel when the lock is held by a worker thread. Doesn't fix the underlying serialization, just makes the wait visible.

Cost: small frontend-only change. No backend correctness improvement.

Recommendation

Option B (explicit path parameter) is the cleanest if upstream hermes-agent is open to it. Option D is the cheapest stopgap if not. Option A is medium-cost but keeps the module-globals shape that other tools may depend on.

Filing this as a tracking issue so #1571's correctness fix can ship today without blocking on this design choice.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttrackingTracking issue for follow-up workuxUser experience / visual polish

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions