Long cron runs hold the global serialization lock for the entire run_job() execution
Summary
PR #1571 (which closes #1573) protects cron API endpoints from cross-profile state corruption by wrapping every call in a cron_profile_context() guarded by a process-wide lock (_cron_env_lock). Correctness-wise this is right, but the lock is held across the entire run_job() execution inside _run_cron_tracked — which can be minutes for a real agent run.
While a long cron run is executing, every other cron API call (sidebar refresh, manual run on a different job, even a same-profile read) blocks on the lock. The Scheduled Jobs panel will appear frozen until the run completes.
Why this isn't blocking #1573
For correctness we need the env+module-constant patch to remain stable across the run, and cron.jobs keeps HERMES_DIR / CRON_DIR / JOBS_FILE / OUTPUT_DIR as module-level globals. Releasing the lock during run_job() would let a concurrent request from a different profile clobber those constants mid-run. Better-correctness-and-frozen-UI > worse-correctness-and-responsive-UI.
But on a busy multi-profile instance, "frozen for minutes when any profile runs a cron job" is poor UX.
Proposed designs (any one, follow-up)
Option A — Per-thread constant resolution
Replace cron.jobs module globals with a per-thread resolver function. Each thread reads its own HERMES_HOME from a threading.local(). Lock would only be needed for the brief env-var swap window, not the full run_job().
Cost: larger upstream change to hermes-agent/cron/jobs.py. Roughly: replace every JOBS_FILE.exists() style reference with _get_jobs_file().exists() where the resolver reads the TLS home.
Option B — Pass paths into run_job(job, *, hermes_home) explicitly
Make the run-time paths a parameter rather than module state. WebUI's _run_cron_tracked would resolve the home, derive paths, pass into run_job(). No global mutation, no lock needed for the run itself.
Cost: also upstream — changes run_job() signature. Backward-compat shim possible.
Option C — Subprocess-isolated cron runs
Spawn run_job() in a multiprocessing.Process with HERMES_HOME set in the child env. Parent webui process never mutates its own env. Lock only protects the env-snapshot used to seed the child, not the child's lifetime.
Cost: more work but cleanest isolation. Process startup latency added per cron run (~50-200ms); negligible relative to a multi-minute agent run.
Option D — Status quo + UX hint
Keep current design but add a "Cron job running — Scheduled Jobs panel will refresh when it completes" banner in the panel when the lock is held by a worker thread. Doesn't fix the underlying serialization, just makes the wait visible.
Cost: small frontend-only change. No backend correctness improvement.
Recommendation
Option B (explicit path parameter) is the cleanest if upstream hermes-agent is open to it. Option D is the cheapest stopgap if not. Option A is medium-cost but keeps the module-globals shape that other tools may depend on.
Filing this as a tracking issue so #1571's correctness fix can ship today without blocking on this design choice.
Related
Long cron runs hold the global serialization lock for the entire
run_job()executionSummary
PR #1571 (which closes #1573) protects cron API endpoints from cross-profile state corruption by wrapping every call in a
cron_profile_context()guarded by a process-wide lock (_cron_env_lock). Correctness-wise this is right, but the lock is held across the entirerun_job()execution inside_run_cron_tracked— which can be minutes for a real agent run.While a long cron run is executing, every other cron API call (sidebar refresh, manual run on a different job, even a same-profile read) blocks on the lock. The Scheduled Jobs panel will appear frozen until the run completes.
Why this isn't blocking #1573
For correctness we need the env+module-constant patch to remain stable across the run, and
cron.jobskeepsHERMES_DIR / CRON_DIR / JOBS_FILE / OUTPUT_DIRas module-level globals. Releasing the lock duringrun_job()would let a concurrent request from a different profile clobber those constants mid-run. Better-correctness-and-frozen-UI > worse-correctness-and-responsive-UI.But on a busy multi-profile instance, "frozen for minutes when any profile runs a cron job" is poor UX.
Proposed designs (any one, follow-up)
Option A — Per-thread constant resolution
Replace
cron.jobsmodule globals with a per-thread resolver function. Each thread reads its ownHERMES_HOMEfrom athreading.local(). Lock would only be needed for the brief env-var swap window, not the fullrun_job().Cost: larger upstream change to
hermes-agent/cron/jobs.py. Roughly: replace everyJOBS_FILE.exists()style reference with_get_jobs_file().exists()where the resolver reads the TLS home.Option B — Pass paths into
run_job(job, *, hermes_home)explicitlyMake the run-time paths a parameter rather than module state. WebUI's
_run_cron_trackedwould resolve the home, derive paths, pass intorun_job(). No global mutation, no lock needed for the run itself.Cost: also upstream — changes
run_job()signature. Backward-compat shim possible.Option C — Subprocess-isolated cron runs
Spawn
run_job()in amultiprocessing.ProcesswithHERMES_HOMEset in the child env. Parent webui process never mutates its own env. Lock only protects the env-snapshot used to seed the child, not the child's lifetime.Cost: more work but cleanest isolation. Process startup latency added per cron run (~50-200ms); negligible relative to a multi-minute agent run.
Option D — Status quo + UX hint
Keep current design but add a "Cron job running — Scheduled Jobs panel will refresh when it completes" banner in the panel when the lock is held by a worker thread. Doesn't fix the underlying serialization, just makes the wait visible.
Cost: small frontend-only change. No backend correctness improvement.
Recommendation
Option B (explicit path parameter) is the cleanest if upstream
hermes-agentis open to it. Option D is the cheapest stopgap if not. Option A is medium-cost but keeps the module-globals shape that other tools may depend on.Filing this as a tracking issue so #1571's correctness fix can ship today without blocking on this design choice.
Related