Skip to content

fix: shorten cron profile lock for manual runs#1746

Closed
Michaelyklam wants to merge 2 commits intonesquena:masterfrom
Michaelyklam:fix/issue-1574-cron-profile-lock-v2
Closed

fix: shorten cron profile lock for manual runs#1746
Michaelyklam wants to merge 2 commits intonesquena:masterfrom
Michaelyklam:fix/issue-1574-cron-profile-lock-v2

Conversation

@Michaelyklam
Copy link
Copy Markdown
Contributor

@Michaelyklam Michaelyklam commented May 6, 2026

Thinking Path

  • Issue follow-up(cron): long cron runs hold the global serialization lock for the entire run_job() execution #1574 points at a real runtime bottleneck: manual cron runs with a selected profile hold the parent WebUI profile/env lock for the entire run_job() execution.
  • The previous profile-isolation work correctly serialized short HERMES_HOME mutations, but long cron execution should not keep unrelated cron/profile UI/API work waiting on that lock.
  • Since the cron scheduler still relies on process-global environment/module state, a child process is the safer boundary than trying to share one long-lived parent process environment across concurrent jobs.
  • This PR keeps the parent responsible for run tracking and persistence, but moves the actual cron job body into a profile-pinned subprocess.

What Changed

  • Added a subprocess entrypoint for manual cron execution that runs cron.scheduler.run_job() inside the selected profile context.
  • Updated _run_cron_tracked() so it no longer imports/calls run_job() directly in the parent worker thread.
  • Preserved parent-side run tracking, output persistence, and profile-home metadata writes.
  • Updated/added regression coverage around:
    • the lock not being held while the cron job body runs;
    • subprocess execution using the selected profile home;
    • no silent fallback to an unpinned profile context;
    • existing per-job profile selector/import invariants.

Why It Matters

  • Long manual cron runs should not block unrelated profile/cron UI operations for the duration of the job.
  • Profile isolation remains explicit: the job body runs in a separate process with its own globals instead of sharing mutable process-wide Hermes environment state with the WebUI server.
  • This narrows the global lock to the short parent-side metadata work it is actually needed for.

Verification

  • Worker targeted tests: /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1574_cron_profile_lock.py tests/test_issue617_cron_profile_selector.py tests/test_cron_run_job_import.py tests/test_scheduled_jobs_profile_isolation.py -q
  • Worker whitespace check: git diff --check
  • GitHub Actions: test (3.11), test (3.12), and test (3.13) are passing on commit e650de2c.
  • No browser/UI media: this is backend cron/runtime behavior with no user-interface change.

Risks / Follow-ups

  • Uses multiprocessing.get_context("fork"), matching the current Linux/self-hosted deployment target. If the WebUI later needs first-class Windows support for this path, this subprocess boundary may need a spawn-compatible entrypoint.
  • The child process returns one result payload through a queue; very large cron output should continue to be handled by the existing cron output persistence path after the parent receives the tuple.
  • The parent still serializes metadata writes; this PR intentionally does not redesign the broader cron storage model.

Model Used

  • OpenAI Codex / GPT-5.5 via Hermes Agent.
  • Tool use: Hermes Kanban worker orchestration, terminal commands, git/GitHub CLI, and file editing tools.

Closes #1574

@Michaelyklam Michaelyklam force-pushed the fix/issue-1574-cron-profile-lock-v2 branch from d97c630 to e650de2 Compare May 6, 2026 16:34
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks @Michaelyklam — the goal of #1574 (releasing the parent profile lock during long manual cron runs) is exactly right, and the subprocess boundary is the right shape for it. We're going to defer this PR from the v0.51.11 release pass because Opus advisor caught a real correctness blocker in the parent-side queue handling that would manifest on real-world job sizes.

Blocker — multiprocessing.Queue deadlock on output > ~64 KB

The current parent-side flow in api/routes.py (around the new _run_cron_job_in_profile_subprocess helper) is:

process.start()
process.join()                          # ← blocks until child exits
status, *payload = result_queue.get_nowait()

The child puts (status, (success, output, final_response, error)) where output/final_response come from cron.scheduler.run_job() — agent logs and full responses, routinely multi-KB to multi-MB. The Python multiprocessing docs are explicit about this hazard:

A process that has put items in a queue will wait before terminating until all the buffered items are fed by the "feeder" thread to the underlying pipe. You need to make sure that all items which have been put on the queue will eventually be removed before the process is joined.

What happens at runtime:

  1. Child finishes the job, calls result_queue.put((status, ...)).
  2. The Queue feeder thread starts pushing the pickle into the pipe buffer (~64 KB on Linux).
  3. Once the pipe fills, the feeder blocks on os.write() waiting for the parent to read.
  4. Child's atexit calls Queue._finalize_join() which joins the feeder thread — feeder is still blocked.
  5. Parent is in process.join(), hasn't called result_queue.get() yet, so the pipe never drains.
  6. Deadlock. Both sides wait forever.

The current tests don't catch this because fake_run_job returns tiny strings. Real cron jobs cross 64 KB easily — agent transcripts, tool outputs, even moderately verbose logs.

Fix — drain the queue BEFORE joining

process.start()
try:
    status, *payload = result_queue.get(timeout=<sensible>)
except queue.Empty:
    # Subprocess exited before producing a result, or hung. Recover.
    ...
process.join()
result_queue.close()
result_queue.join_thread()

The get(timeout=...) reads the pipe-buffered tuple out, the feeder thread can finish, the child's atexit unblocks, and process.join() then returns cleanly.

Add a regression test that exercises a >100 KB stdout payload (e.g. output = "x" * 200_000) so the pattern can't regress.

Recommended also — switch from fork to spawn

multiprocessing.get_context("fork") from a multi-threaded process (which the WebUI is — HTTP server threads + cron worker thread + others) is a well-known footgun:

  • Other threads' lock state (logging, importlib's import lock, internal stdlib locks) is inherited as "held" with no releaser thread in the child.
  • The child does new imports (from cron.scheduler import run_job, from api.profiles import cron_profile_context_for_home). If any other parent thread is mid-import at fork time, the child's import deadlocks.
  • Python 3.12+ already raises a DeprecationWarning for fork-from-thread.

Two ways to fix:

Option A — use spawn:

ctx = mp.get_context("spawn")
process = ctx.Process(target=_cron_job_subprocess_main, args=(...))

Target and args are picklable (top-level function, dict, Path, mp.Queue), so spawn works with no other changes. Slightly slower startup (~100ms cold cache), but immune to the fork-thread issues.

Option B — pre-import in the parent:

Top of api/routes.py:

from cron.scheduler import run_job as _cron_run_job_preimport  # noqa
from api.profiles import cron_profile_context_for_home as _cron_ctx_preimport  # noqa

Now the child re-imports become attribute lookups against already-loaded modules; no import lock acquisition needed.

I'd lean Option A for a clean break; it's an extra get_context call.

What stays clean

The rest of the PR is in good shape — the subprocess boundary is correctly drawn, the parent retains run tracking + persistence + profile-home metadata writes, the _cron_env_lock is correctly NOT held during subprocess execution (Q2 of Opus brief verified), and the per-job profile selector / import invariants still pass.

Once the queue-drain fix lands and there's a regression test for >100 KB output, this is ready to ship. We'll pull it into the v0.51.12 batch as soon as it's updated.

(For context: #1747, #1748, #1750 are shipping in v0.51.11 right now.)

@Michaelyklam
Copy link
Copy Markdown
Contributor Author

Addressed the queue-drain blocker from #1746 (comment) in follow-up commit 65ff8a9.

What changed:

  • Parent now reads result_queue.get(timeout=...) before process.join(), so large output / final_response payloads drain before the child feeder thread can deadlock at process exit.
  • Added recovery paths for no-result / hung children, then closes and joins the Queue thread in the parent.
  • Added a regression test using a fake cron package returning 200,000-char output and final response; the old join-before-drain path deadlocked and the new path returns cleanly.

Verification:

  • RED: tests/test_issue1574_cron_profile_lock.py::test_manual_cron_subprocess_drains_large_result_before_join failed on the old code with the expected deadlock assertion.
  • env -u HERMES_CONFIG_PATH /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1574_cron_profile_lock.py -q → 3 passed.
  • env -u HERMES_CONFIG_PATH /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1574_cron_profile_lock.py tests/test_scheduled_jobs_profile_isolation.py tests/test_cron_run_job_import.py tests/test_issue617_cron_profile_selector.py tests/test_cron_manual_run_persistence.py -q → 21 passed.
  • git diff --check → clean.
  • env -u HERMES_CONFIG_PATH -u HERMES_WEBUI_HOST /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/ -q → 4594 passed, 2 skipped, 3 xpassed, 1 warning, 8 subtests passed.
  • GitHub Actions for commit 65ff8a9: test (3.11), test (3.12), and test (3.13) all passed.

I kept the multiprocessing context unchanged in this narrow follow-up rather than combining the blocker fix with a spawn migration; the deadlock is fixed by the queue-order change, and a spawn switch would be a separate bootstrap/import-behavior change around the profile-pinned cron module setup.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks @Michaelyklam — this shipped in v0.51.12 (commit 34f2243) as part of a 3-PR full-sweep batch release. Stage rebased your branch onto current master, ran the full pre-release gate (4632 pytest, browser tests, Opus advisor verdict SHIP), and merged via release PR #1755.

GitHub didn't auto-close because the merge commit only references the squash-merged stage branch, not your fork's commit directly — closing manually for hygiene.

Live now on https://get-hermes.ai/ and on existing installs after git pull + restart.

Release notes: https://github.com/nesquena/hermes-webui/releases/tag/v0.51.12

pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 6, 2026
…-color meta, quote-strip) + test-isolation hardening (nesquena#1746 deferred)

Constituent PRs:
- nesquena#1747 (@Michaelyklam) — wait for model catalog before opening picker (closes nesquena#1743)
- nesquena#1748 (@nesquena-hermes) — theme-color meta tag for native chrome bridges (nesquena APPROVED)
- nesquena#1750 (@nesquena-hermes) — strip surrounding quotes from Add Space path (nesquena APPROVED)

Deferred to v0.51.12:
- nesquena#1746 — Opus caught multiprocessing.Queue deadlock pattern (parent
  process.join() before queue drain hangs on output >64KB pipe buffer).
  Deferral comment with two specific fix options posted on PR.

Plus 1 in-stage absorbed test-isolation fix:
- test_issue1426 + test_issue1680: skip on detected prefix pollution
  (prong 2 of test-isolation-flake-recipe). Failure rate ~25% in full
  suite from sys.modules pollution; standalone always passes.

Tests: 4596 → 4622 passing (+26). 0 regressions. Stably green.

Pre-release verification:
- 3 PRs CI-green individually + rebased onto master
- pytest 4622 passed, 0 failed
- node -c clean on static/ui.js + static/boot.js
- 11/11 browser API endpoints PASS
- Opus advisor: SHIP nesquena#1747/nesquena#1748/nesquena#1750, MUST-FIX block on nesquena#1746

Closes nesquena#1743.
pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 6, 2026
pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 6, 2026
… custom provider routing + session runtime invariants)

Constituent PRs:
- nesquena#1746 (@Michaelyklam) — shorten cron profile lock for manual runs (closes nesquena#1574, RETURNS from v0.51.11 deferral with queue-drain blocker fixed)
- nesquena#1752 (@Michaelyklam) — route custom provider models dict selections (slice of nesquena#1240 umbrella)
- nesquena#1753 (@Michaelyklam) — guard session-owned runtime invariants (refs nesquena#1694)

nesquena#1746 v2 fix: result_queue.get(timeout=...) BEFORE process.join()
(drain-then-join), with queue.Empty recovery + 200,000-char regression test.
Opus stage-306 verified the fix correct + complete; the prior fork→spawn
SHOULD-FIX filed as follow-up issue nesquena#1754 (separate architectural change).

Tests: 4622 → 4632 passing (+10). 0 regressions. Stably green on first try.

Pre-release verification:
- All 3 PRs CI-green individually + rebased onto master with NO conflicts
  (disjoint files: api/config.py + static/messages.js + api/routes.py)
- pytest 4632 passed, 0 failed
- node -c clean on static/messages.js
- 11/11 browser API endpoints PASS
- Opus advisor: SHIP all 3, 0 MUST-FIX, 1 SHOULD-FIX filed as nesquena#1754

Closes nesquena#1574.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

follow-up(cron): long cron runs hold the global serialization lock for the entire run_job() execution

2 participants