fix: shorten cron profile lock for manual runs by Michaelyklam · Pull Request #1746 · nesquena/hermes-webui

Michaelyklam · 2026-05-06T16:29:32Z

Thinking Path

Issue follow-up(cron): long cron runs hold the global serialization lock for the entire run_job() execution #1574 points at a real runtime bottleneck: manual cron runs with a selected profile hold the parent WebUI profile/env lock for the entire run_job() execution.
The previous profile-isolation work correctly serialized short HERMES_HOME mutations, but long cron execution should not keep unrelated cron/profile UI/API work waiting on that lock.
Since the cron scheduler still relies on process-global environment/module state, a child process is the safer boundary than trying to share one long-lived parent process environment across concurrent jobs.
This PR keeps the parent responsible for run tracking and persistence, but moves the actual cron job body into a profile-pinned subprocess.

What Changed

Added a subprocess entrypoint for manual cron execution that runs cron.scheduler.run_job() inside the selected profile context.
Updated _run_cron_tracked() so it no longer imports/calls run_job() directly in the parent worker thread.
Preserved parent-side run tracking, output persistence, and profile-home metadata writes.
Updated/added regression coverage around:
- the lock not being held while the cron job body runs;
- subprocess execution using the selected profile home;
- no silent fallback to an unpinned profile context;
- existing per-job profile selector/import invariants.

Why It Matters

Long manual cron runs should not block unrelated profile/cron UI operations for the duration of the job.
Profile isolation remains explicit: the job body runs in a separate process with its own globals instead of sharing mutable process-wide Hermes environment state with the WebUI server.
This narrows the global lock to the short parent-side metadata work it is actually needed for.

Verification

Worker targeted tests: /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1574_cron_profile_lock.py tests/test_issue617_cron_profile_selector.py tests/test_cron_run_job_import.py tests/test_scheduled_jobs_profile_isolation.py -q
Worker whitespace check: git diff --check
GitHub Actions: test (3.11), test (3.12), and test (3.13) are passing on commit e650de2c.
No browser/UI media: this is backend cron/runtime behavior with no user-interface change.

Risks / Follow-ups

Uses multiprocessing.get_context("fork"), matching the current Linux/self-hosted deployment target. If the WebUI later needs first-class Windows support for this path, this subprocess boundary may need a spawn-compatible entrypoint.
The child process returns one result payload through a queue; very large cron output should continue to be handled by the existing cron output persistence path after the parent receives the tuple.
The parent still serializes metadata writes; this PR intentionally does not redesign the broader cron storage model.

Model Used

OpenAI Codex / GPT-5.5 via Hermes Agent.
Tool use: Hermes Kanban worker orchestration, terminal commands, git/GitHub CLI, and file editing tools.

Closes #1574

nesquena-hermes · 2026-05-06T17:39:00Z

Thanks @Michaelyklam — the goal of #1574 (releasing the parent profile lock during long manual cron runs) is exactly right, and the subprocess boundary is the right shape for it. We're going to defer this PR from the v0.51.11 release pass because Opus advisor caught a real correctness blocker in the parent-side queue handling that would manifest on real-world job sizes.

Blocker — `multiprocessing.Queue` deadlock on output > ~64 KB

The current parent-side flow in api/routes.py (around the new _run_cron_job_in_profile_subprocess helper) is:

process.start()
process.join()                          # ← blocks until child exits
status, *payload = result_queue.get_nowait()

The child puts (status, (success, output, final_response, error)) where output/final_response come from cron.scheduler.run_job() — agent logs and full responses, routinely multi-KB to multi-MB. The Python multiprocessing docs are explicit about this hazard:

A process that has put items in a queue will wait before terminating until all the buffered items are fed by the "feeder" thread to the underlying pipe. You need to make sure that all items which have been put on the queue will eventually be removed before the process is joined.

What happens at runtime:

Child finishes the job, calls result_queue.put((status, ...)).
The Queue feeder thread starts pushing the pickle into the pipe buffer (~64 KB on Linux).
Once the pipe fills, the feeder blocks on os.write() waiting for the parent to read.
Child's atexit calls Queue._finalize_join() which joins the feeder thread — feeder is still blocked.
Parent is in process.join(), hasn't called result_queue.get() yet, so the pipe never drains.
Deadlock. Both sides wait forever.

The current tests don't catch this because fake_run_job returns tiny strings. Real cron jobs cross 64 KB easily — agent transcripts, tool outputs, even moderately verbose logs.

Fix — drain the queue BEFORE joining

process.start()
try:
    status, *payload = result_queue.get(timeout=<sensible>)
except queue.Empty:
    # Subprocess exited before producing a result, or hung. Recover.
    ...
process.join()
result_queue.close()
result_queue.join_thread()

The get(timeout=...) reads the pipe-buffered tuple out, the feeder thread can finish, the child's atexit unblocks, and process.join() then returns cleanly.

Add a regression test that exercises a >100 KB stdout payload (e.g. output = "x" * 200_000) so the pattern can't regress.

Recommended also — switch from `fork` to `spawn`

multiprocessing.get_context("fork") from a multi-threaded process (which the WebUI is — HTTP server threads + cron worker thread + others) is a well-known footgun:

Other threads' lock state (logging, importlib's import lock, internal stdlib locks) is inherited as "held" with no releaser thread in the child.
The child does new imports (from cron.scheduler import run_job, from api.profiles import cron_profile_context_for_home). If any other parent thread is mid-import at fork time, the child's import deadlocks.
Python 3.12+ already raises a DeprecationWarning for fork-from-thread.

Two ways to fix:

Option A — use spawn:

ctx = mp.get_context("spawn")
process = ctx.Process(target=_cron_job_subprocess_main, args=(...))

Target and args are picklable (top-level function, dict, Path, mp.Queue), so spawn works with no other changes. Slightly slower startup (~100ms cold cache), but immune to the fork-thread issues.

Option B — pre-import in the parent:

Top of api/routes.py:

from cron.scheduler import run_job as _cron_run_job_preimport  # noqa
from api.profiles import cron_profile_context_for_home as _cron_ctx_preimport  # noqa

Now the child re-imports become attribute lookups against already-loaded modules; no import lock acquisition needed.

I'd lean Option A for a clean break; it's an extra get_context call.

What stays clean

The rest of the PR is in good shape — the subprocess boundary is correctly drawn, the parent retains run tracking + persistence + profile-home metadata writes, the _cron_env_lock is correctly NOT held during subprocess execution (Q2 of Opus brief verified), and the per-job profile selector / import invariants still pass.

Once the queue-drain fix lands and there's a regression test for >100 KB output, this is ready to ship. We'll pull it into the v0.51.12 batch as soon as it's updated.

(For context: #1747, #1748, #1750 are shipping in v0.51.11 right now.)

Michaelyklam · 2026-05-06T18:07:47Z

Addressed the queue-drain blocker from #1746 (comment) in follow-up commit 65ff8a9.

What changed:

Parent now reads result_queue.get(timeout=...) before process.join(), so large output / final_response payloads drain before the child feeder thread can deadlock at process exit.
Added recovery paths for no-result / hung children, then closes and joins the Queue thread in the parent.
Added a regression test using a fake cron package returning 200,000-char output and final response; the old join-before-drain path deadlocked and the new path returns cleanly.

Verification:

RED: tests/test_issue1574_cron_profile_lock.py::test_manual_cron_subprocess_drains_large_result_before_join failed on the old code with the expected deadlock assertion.
env -u HERMES_CONFIG_PATH /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1574_cron_profile_lock.py -q → 3 passed.
env -u HERMES_CONFIG_PATH /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1574_cron_profile_lock.py tests/test_scheduled_jobs_profile_isolation.py tests/test_cron_run_job_import.py tests/test_issue617_cron_profile_selector.py tests/test_cron_manual_run_persistence.py -q → 21 passed.
git diff --check → clean.
env -u HERMES_CONFIG_PATH -u HERMES_WEBUI_HOST /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/ -q → 4594 passed, 2 skipped, 3 xpassed, 1 warning, 8 subtests passed.
GitHub Actions for commit 65ff8a9: test (3.11), test (3.12), and test (3.13) all passed.

I kept the multiprocessing context unchanged in this narrow follow-up rather than combining the blocker fix with a spawn migration; the deadlock is fixed by the queue-order change, and a spawn switch would be a separate bootstrap/import-behavior change around the profile-pinned cron module setup.

nesquena-hermes · 2026-05-06T18:27:11Z

Thanks @Michaelyklam — this shipped in v0.51.12 (commit 34f2243) as part of a 3-PR full-sweep batch release. Stage rebased your branch onto current master, ran the full pre-release gate (4632 pytest, browser tests, Opus advisor verdict SHIP), and merged via release PR #1755.

GitHub didn't auto-close because the merge commit only references the squash-merged stage branch, not your fork's commit directly — closing manually for hygiene.

Live now on https://get-hermes.ai/ and on existing installs after git pull + restart.

Release notes: https://github.com/nesquena/hermes-webui/releases/tag/v0.51.12

@Michaelyklam

…-color meta, quote-strip) + test-isolation hardening (nesquena#1746 deferred) Constituent PRs: - nesquena#1747 (@Michaelyklam) — wait for model catalog before opening picker (closes nesquena#1743) - nesquena#1748 (@nesquena-hermes) — theme-color meta tag for native chrome bridges (nesquena APPROVED) - nesquena#1750 (@nesquena-hermes) — strip surrounding quotes from Add Space path (nesquena APPROVED) Deferred to v0.51.12: - nesquena#1746 — Opus caught multiprocessing.Queue deadlock pattern (parent process.join() before queue drain hangs on output >64KB pipe buffer). Deferral comment with two specific fix options posted on PR. Plus 1 in-stage absorbed test-isolation fix: - test_issue1426 + test_issue1680: skip on detected prefix pollution (prong 2 of test-isolation-flake-recipe). Failure rate ~25% in full suite from sys.modules pollution; standalone always passes. Tests: 4596 → 4622 passing (+26). 0 regressions. Stably green. Pre-release verification: - 3 PRs CI-green individually + rebased onto master - pytest 4622 passed, 0 failed - node -c clean on static/ui.js + static/boot.js - 11/11 browser API endpoints PASS - Opus advisor: SHIP nesquena#1747/nesquena#1748/nesquena#1750, MUST-FIX block on nesquena#1746 Closes nesquena#1743.

@Michaelyklam

…al runs by @Michaelyklam

@Michaelyklam

… custom provider routing + session runtime invariants) Constituent PRs: - nesquena#1746 (@Michaelyklam) — shorten cron profile lock for manual runs (closes nesquena#1574, RETURNS from v0.51.11 deferral with queue-drain blocker fixed) - nesquena#1752 (@Michaelyklam) — route custom provider models dict selections (slice of nesquena#1240 umbrella) - nesquena#1753 (@Michaelyklam) — guard session-owned runtime invariants (refs nesquena#1694) nesquena#1746 v2 fix: result_queue.get(timeout=...) BEFORE process.join() (drain-then-join), with queue.Empty recovery + 200,000-char regression test. Opus stage-306 verified the fix correct + complete; the prior fork→spawn SHOULD-FIX filed as follow-up issue nesquena#1754 (separate architectural change). Tests: 4622 → 4632 passing (+10). 0 regressions. Stably green on first try. Pre-release verification: - All 3 PRs CI-green individually + rebased onto master with NO conflicts (disjoint files: api/config.py + static/messages.js + api/routes.py) - pytest 4632 passed, 0 failed - node -c clean on static/messages.js - 11/11 browser API endpoints PASS - Opus advisor: SHIP all 3, 0 MUST-FIX, 1 SHOULD-FIX filed as nesquena#1754 Closes nesquena#1574.

fix: shorten cron profile lock for manual runs

e650de2

Michaelyklam force-pushed the fix/issue-1574-cron-profile-lock-v2 branch from d97c630 to e650de2 Compare May 6, 2026 16:34

Michaelyklam mentioned this pull request May 6, 2026

fix: shorten manual cron profile lock scope #1749

Closed

nesquena-hermes mentioned this pull request May 6, 2026

v0.51.11 — 3-PR batch (model picker race, theme-color meta, quote-strip) #1751

Merged

fix: drain cron subprocess results before join

65ff8a9

This was referenced May 6, 2026

Migrate cron subprocess from fork to spawn (PR #1746 follow-up) #1754

Closed

v0.51.12 — 3-PR batch (cron subprocess return + custom provider routing + session runtime invariants) #1755

Merged

nesquena-hermes closed this May 6, 2026

pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 6, 2026

Stage 306: PR nesquena#1746 — fix: shorten cron profile lock for manu…

75460af

…al runs by @Michaelyklam

Michaelyklam mentioned this pull request May 7, 2026

fix: use spawn for manual cron subprocesses #1767

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: shorten cron profile lock for manual runs#1746

fix: shorten cron profile lock for manual runs#1746
Michaelyklam wants to merge 2 commits intonesquena:masterfrom
Michaelyklam:fix/issue-1574-cron-profile-lock-v2

Michaelyklam commented May 6, 2026 •

edited

Loading

Uh oh!

nesquena-hermes commented May 6, 2026

Uh oh!

Michaelyklam commented May 6, 2026

Uh oh!

nesquena-hermes commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Michaelyklam commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Thinking Path

What Changed

Why It Matters

Verification

Risks / Follow-ups

Model Used

Uh oh!

nesquena-hermes commented May 6, 2026

Blocker — multiprocessing.Queue deadlock on output > ~64 KB

Fix — drain the queue BEFORE joining

Recommended also — switch from fork to spawn

What stays clean

Uh oh!

Michaelyklam commented May 6, 2026

Uh oh!

nesquena-hermes commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Michaelyklam commented May 6, 2026 •

edited

Loading

Blocker — `multiprocessing.Queue` deadlock on output > ~64 KB

Recommended also — switch from `fork` to `spawn`