You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Crash recovery: WAL design + integration for in-flight assistant tokens
Background
When hermes-webui is killed (OOM, SIGKILL, container restart, kernel panic) during an active streaming agent run, tokens streamed since the last checkpoint are lost. The session JSON reverts to showing only the user's pending message; the assistant reply that was mid-stream simply vanishes.
The diagnosis from PR #1353 (@JKJameson) is correct on both counts:
_checkpoint_activity is incremented only in on_tool() — so a pure-text stream with zero tool calls is never flagged for periodic checkpointing.
But the further wrinkle is: the _checkpoint_activity increment alone wouldn't actually fix the bug. s.messages is never mutated during streaming — it only updates at api/streaming.py:2141afteragent.run_conversation() returns. The periodic 15s checkpoint thread saves s.pending_user_message + s.active_stream_id + s.pending_started_at, all of which are already on disk from the pre-stream s.save() at line 2092. So even with _checkpoint_activity bumping every token, the checkpoint thread persists nothing new during a text-only stream.
The real fix is shadowing the in-memory token accumulators (STREAM_PARTIAL_TEXT, STREAM_REASONING_TEXT) to a sidecar log on disk, then replaying that log on session reload after a crash.
User impact
This bug class has been hit by users:
Webui lost my prompts #1217 — "Webui lost my prompts" (closed; root cause was different but related — context compaction; the user-facing symptom — message vanished — is what we're addressing here).
The streaming-loss path is silent: there is no error, no toast, no "session was interrupted" indicator. The user simply sends a message, sees the agent start to type, then on next reload the message and reply are gone.
Durability target is "after-crash," not "after-power-loss." Per-event fsync() is overkill — the kernel's page cache will preserve append-only writes through any process death short of an unsynchronized power outage, which is not in scope.
Cannot regress streaming throughput. A model emitting 60 tokens/sec must not see measurable latency increase from the durability layer. This rules out synchronous fsync() per token on the streaming hot path.
Cannot regress disk I/O on shared storage. Some users mount ~/.hermes on NFS or other network filesystems. The durability layer must batch writes — at most one write() syscall per N tokens, where N is large enough that NFS write amplification is bounded.
Must work without a dedicated background thread per session. Spawning a writer thread per active stream is expensive at scale (e.g. multi-tab cron-driven sessions); a single shared writer thread with a queue.Queue of (session_id, event) tuples is the right shape.
Must be opt-in until perf is measured. Ship behind HERMES_WEBUI_WAL_ENABLED=1 env var; default off; flip the default after at least one full release cycle of canary feedback from users running on slow disks / network homedirs.
Recovery side must be cheap on every session load. A Path.exists() check per get_session() is fine; reading and parsing a JSONL file on every session load is not. The existence check should short-circuit the no-WAL fast path within microseconds.
Must compose with existing _repair_stale_pending / _apply_core_sync_or_error_marker. The existing repair flow already handles the no-token-recovered case (puts up an "agent was responding when interrupted" marker). The WAL layer fills in token-level recovery on top, only when a WAL file is present.
api/models.py:_replay_wal_recovery() gating — the four-condition guard (active_stream_id set + stream not in STREAMS + last message is user role + WAL file exists). Sound logic.
No per-session in-memory buffer. Skip the _write_buffer + _token_counts + _last_flush_time dicts entirely. A single shared writer thread reading from a Queue gives the same ≤3s recovery bound with simpler semantics.
No _should_flush() / _WAL_FLUSH_TOKENS / _WAL_FLUSH_INTERVAL machinery. The shared writer thread drains the queue with queue.Queue.get(timeout=3.0) — every event flushes within min(N, 3s) automatically. No threshold tuning, no dead-code flush helpers.
Wire into existing STREAM_PARTIAL_TEXT[stream_id] callsite at api/streaming.py:1643 rather than duplicating accumulation in parallel. Pattern:
No _checkpoint_activity bump in on_token. As noted above, that change alone does nothing useful — the checkpoint thread persists state that's already on disk. The WAL is the load-bearing change; the existing checkpoint thread can stay as it is (it serves a different role, persisting bookkeeping in the success path).
Implementation sketch (for whoever picks this up)
Phase 1 — primitive (no streaming integration yet)
Stream end success path (just before the STREAM_PARTIAL_TEXT.pop(stream_id, None) cleanup at line 2778): _wal.put('end', stream_id, {'clean': True}) + _wal.delete_wal_after_drain(stream_id) (helper that waits for the writer thread to drain its queue, then deletes the WAL file).
Critical: the cleanup delete_wal() must only run on the clean success path, not in the finally block. Cancel/error paths leave the WAL for replay.
Phase 3 — recovery integration
api/models.py:_replay_wal_recovery() ported from #1353 with minor adjustments:
Hook called from get_session() after the existing _repair_stale_pending flow, so the no-token case still gets the "interrupted" marker.
Gating: active_stream_id set + stream not in STREAMS + last message is user role + wal_path(sid).exists().
Replay → append assistant message with _wal_recovered: True and conditional _partial: True (heuristic: ends in alphanumeric without sentence-ending punctuation).
Critical: a non-integration regression test that exercises the streaming path with a stub agent and asserts the WAL file gets at least one start event after chat/start returns. This is the test that would have caught the writer-disconnect class of bug in Fix/chat history WAL recovery #1353.
Integration test (gated, optional): full crash-and-recover against a live server.
Phase 5 — rollout
Ship Phase 1+2+3+4 with HERMES_WEBUI_WAL_ENABLED=0 default. Document the env var. Solicit feedback from canary users.
After at least one full release cycle of canary feedback, flip default to 1.
Acceptance criteria
HERMES_WEBUI_WAL_ENABLED=1 + kill -9 mid-stream + reload → assistant message appears with _wal_recovered: True, "Recovered" badge visible, _partial flag accurate.
HERMES_WEBUI_WAL_ENABLED=0 (default) → zero WAL files written, zero perf change vs current master.
No regression in p50/p95 streaming latency on enabled mode (measured against a 10K-token text stream).
git grep "_wal\." in api/ shows write call sites in streaming.py and read call sites in models.py. No dead-code wrappers.
Non-integration test asserts WAL file gets a start event during a stub-agent stream — fails if streaming integration is removed.
User-facing docs in docs/ explain the env var and the recovery affordance.
Priority
Medium. Crash-recovery is real but rare in practice — the user has to be killed mid-stream during a pure-text reply. The cheap part of the fix (UI affordance + repair-on-load) already works via _repair_stale_pending. The WAL is the upgrade from "agent was responding when interrupted" to "here is the actual partial reply." Worth doing carefully, not urgently.
Crash recovery: WAL design + integration for in-flight assistant tokens
Background
When
hermes-webuiis killed (OOM, SIGKILL, container restart, kernel panic) during an active streaming agent run, tokens streamed since the last checkpoint are lost. The session JSON reverts to showing only the user's pending message; the assistant reply that was mid-stream simply vanishes.The diagnosis from PR #1353 (@JKJameson) is correct on both counts:
_checkpoint_activityis incremented only inon_tool()— so a pure-text stream with zero tool calls is never flagged for periodic checkpointing.STREAM_PARTIAL_TEXT[stream_id]andSTREAM_REASONING_TEXT[stream_id]accumulate the in-flight tokens in memory (introduced in Bug: Stop_generation button deletes LLM response #893 and bug(streaming): Stop/Cancel discards AI's already-streamed reasoning + tool output (and possibly user prompt) — paid tokens lost #1361), but they live only in process memory and are lost on crash.But the further wrinkle is: the
_checkpoint_activityincrement alone wouldn't actually fix the bug.s.messagesis never mutated during streaming — it only updates atapi/streaming.py:2141afteragent.run_conversation()returns. The periodic 15s checkpoint thread savess.pending_user_message+s.active_stream_id+s.pending_started_at, all of which are already on disk from the pre-streams.save()at line 2092. So even with_checkpoint_activitybumping every token, the checkpoint thread persists nothing new during a text-only stream.The real fix is shadowing the in-memory token accumulators (
STREAM_PARTIAL_TEXT,STREAM_REASONING_TEXT) to a sidecar log on disk, then replaying that log on session reload after a crash.User impact
This bug class has been hit by users:
Design constraints
From the closer reading after #1353:
Durability target is "after-crash," not "after-power-loss." Per-event
fsync()is overkill — the kernel's page cache will preserve append-only writes through any process death short of an unsynchronized power outage, which is not in scope.Cannot regress streaming throughput. A model emitting 60 tokens/sec must not see measurable latency increase from the durability layer. This rules out synchronous
fsync()per token on the streaming hot path.Cannot regress disk I/O on shared storage. Some users mount
~/.hermeson NFS or other network filesystems. The durability layer must batch writes — at most onewrite()syscall per N tokens, where N is large enough that NFS write amplification is bounded.Must work without a dedicated background thread per session. Spawning a writer thread per active stream is expensive at scale (e.g. multi-tab cron-driven sessions); a single shared writer thread with a
queue.Queueof(session_id, event)tuples is the right shape.Must be opt-in until perf is measured. Ship behind
HERMES_WEBUI_WAL_ENABLED=1env var; default off; flip the default after at least one full release cycle of canary feedback from users running on slow disks / network homedirs.Recovery side must be cheap on every session load. A
Path.exists()check perget_session()is fine; reading and parsing a JSONL file on every session load is not. The existence check should short-circuit the no-WAL fast path within microseconds.Must compose with existing
_repair_stale_pending/_apply_core_sync_or_error_marker. The existing repair flow already handles the no-token-recovered case (puts up an "agent was responding when interrupted" marker). The WAL layer fills in token-level recovery on top, only when a WAL file is present.Salvage map from #1353
Direct salvage candidates with
Co-authored-by: Josh Jameson <…>attribution:{type: "token"|"reasoning"|"tool"|"tool_result"|"start"|"end"|"apperror", timestamp, ...payload}. Sound shape, no rework needed.api/wal.py:replay_wal()— the event accumulation logic (concatenate token text, concatenate reasoning, list tool calls + results, surfacehad_error). Direct port.api/wal.py:read_wal()+delete_wal()— straightforward primitives.api/models.py:_replay_wal_recovery()gating — the four-condition guard (active_stream_idset + stream not inSTREAMS+ last message is user role + WAL file exists). Sound logic._wal_recovered: Truemarker + UI "Recovered" badge — useful user-facing affordance.Build-fresh items:
_write_buffer+_token_counts+_last_flush_timedicts entirely. A single shared writer thread reading from aQueuegives the same ≤3s recovery bound with simpler semantics._should_flush()/_WAL_FLUSH_TOKENS/_WAL_FLUSH_INTERVALmachinery. The shared writer thread drains the queue withqueue.Queue.get(timeout=3.0)— every event flushes withinmin(N, 3s)automatically. No threshold tuning, no dead-code flush helpers.STREAM_PARTIAL_TEXT[stream_id]callsite atapi/streaming.py:1643rather than duplicating accumulation in parallel. Pattern:_checkpoint_activitybump inon_token. As noted above, that change alone does nothing useful — the checkpoint thread persists state that's already on disk. The WAL is the load-bearing change; the existing checkpoint thread can stay as it is (it serves a different role, persisting bookkeeping in the success path).Implementation sketch (for whoever picks this up)
Phase 1 — primitive (no streaming integration yet)
Phase 2 — streaming integration
5 call sites in
api/streaming.py:STREAM_PARTIAL_TEXT[stream_id] = ''at line 1442):_wal.put('start', stream_id, {'session_id': session_id})+_wal.ensure_writer().on_token(afterSTREAM_PARTIAL_TEXT[stream_id] += str(text)at line 1643):_wal.put('token', stream_id, str(text)).on_reasoning(afterSTREAM_REASONING_TEXT[stream_id] += str(text)at line 1656):_wal.put('reasoning', stream_id, str(text)).on_toolstart + completed paths:_wal.put('tool', stream_id, {...}),_wal.put('tool_result', stream_id, {...}).STREAM_PARTIAL_TEXT.pop(stream_id, None)cleanup at line 2778):_wal.put('end', stream_id, {'clean': True})+_wal.delete_wal_after_drain(stream_id)(helper that waits for the writer thread to drain its queue, then deletes the WAL file).Critical: the cleanup
delete_wal()must only run on the clean success path, not in thefinallyblock. Cancel/error paths leave the WAL for replay.Phase 3 — recovery integration
api/models.py:_replay_wal_recovery()ported from #1353 with minor adjustments:get_session()after the existing_repair_stale_pendingflow, so the no-token case still gets the "interrupted" marker.active_stream_idset + stream not inSTREAMS+ last message is user role +wal_path(sid).exists()._wal_recovered: Trueand conditional_partial: True(heuristic: ends in alphanumeric without sentence-ending punctuation).Phase 4 — tests
tests/test_wal_recovery.py).startevent afterchat/startreturns. This is the test that would have caught the writer-disconnect class of bug in Fix/chat history WAL recovery #1353.Phase 5 — rollout
HERMES_WEBUI_WAL_ENABLED=0default. Document the env var. Solicit feedback from canary users.1.Acceptance criteria
HERMES_WEBUI_WAL_ENABLED=1+ kill -9 mid-stream + reload → assistant message appears with_wal_recovered: True, "Recovered" badge visible,_partialflag accurate.HERMES_WEBUI_WAL_ENABLED=0(default) → zero WAL files written, zero perf change vs current master.git grep "_wal\."inapi/shows write call sites instreaming.pyand read call sites inmodels.py. No dead-code wrappers.startevent during a stub-agent stream — fails if streaming integration is removed.docs/explain the env var and the recovery affordance.Priority
Medium. Crash-recovery is real but rare in practice — the user has to be killed mid-stream during a pure-text reply. The cheap part of the fix (UI affordance + repair-on-load) already works via
_repair_stale_pending. The WAL is the upgrade from "agent was responding when interrupted" to "here is the actual partial reply." Worth doing carefully, not urgently.References