Skip to content

Crash recovery: WAL design + integration for in-flight assistant tokens #1585

@nesquena-hermes

Description

@nesquena-hermes

Crash recovery: WAL design + integration for in-flight assistant tokens

Background

When hermes-webui is killed (OOM, SIGKILL, container restart, kernel panic) during an active streaming agent run, tokens streamed since the last checkpoint are lost. The session JSON reverts to showing only the user's pending message; the assistant reply that was mid-stream simply vanishes.

The diagnosis from PR #1353 (@JKJameson) is correct on both counts:

  1. _checkpoint_activity is incremented only in on_tool() — so a pure-text stream with zero tool calls is never flagged for periodic checkpointing.
  2. There is no token-level durability primitive. STREAM_PARTIAL_TEXT[stream_id] and STREAM_REASONING_TEXT[stream_id] accumulate the in-flight tokens in memory (introduced in Bug: Stop_generation button deletes LLM response #893 and bug(streaming): Stop/Cancel discards AI's already-streamed reasoning + tool output (and possibly user prompt) — paid tokens lost #1361), but they live only in process memory and are lost on crash.

But the further wrinkle is: the _checkpoint_activity increment alone wouldn't actually fix the bug. s.messages is never mutated during streaming — it only updates at api/streaming.py:2141 after agent.run_conversation() returns. The periodic 15s checkpoint thread saves s.pending_user_message + s.active_stream_id + s.pending_started_at, all of which are already on disk from the pre-stream s.save() at line 2092. So even with _checkpoint_activity bumping every token, the checkpoint thread persists nothing new during a text-only stream.

The real fix is shadowing the in-memory token accumulators (STREAM_PARTIAL_TEXT, STREAM_REASONING_TEXT) to a sidecar log on disk, then replaying that log on session reload after a crash.

User impact

This bug class has been hit by users:

  • Webui lost my prompts #1217 — "Webui lost my prompts" (closed; root cause was different but related — context compaction; the user-facing symptom — message vanished — is what we're addressing here).
  • The streaming-loss path is silent: there is no error, no toast, no "session was interrupted" indicator. The user simply sends a message, sees the agent start to type, then on next reload the message and reply are gone.

Design constraints

From the closer reading after #1353:

  1. Durability target is "after-crash," not "after-power-loss." Per-event fsync() is overkill — the kernel's page cache will preserve append-only writes through any process death short of an unsynchronized power outage, which is not in scope.

  2. Cannot regress streaming throughput. A model emitting 60 tokens/sec must not see measurable latency increase from the durability layer. This rules out synchronous fsync() per token on the streaming hot path.

  3. Cannot regress disk I/O on shared storage. Some users mount ~/.hermes on NFS or other network filesystems. The durability layer must batch writes — at most one write() syscall per N tokens, where N is large enough that NFS write amplification is bounded.

  4. Must work without a dedicated background thread per session. Spawning a writer thread per active stream is expensive at scale (e.g. multi-tab cron-driven sessions); a single shared writer thread with a queue.Queue of (session_id, event) tuples is the right shape.

  5. Must be opt-in until perf is measured. Ship behind HERMES_WEBUI_WAL_ENABLED=1 env var; default off; flip the default after at least one full release cycle of canary feedback from users running on slow disks / network homedirs.

  6. Recovery side must be cheap on every session load. A Path.exists() check per get_session() is fine; reading and parsing a JSONL file on every session load is not. The existence check should short-circuit the no-WAL fast path within microseconds.

  7. Must compose with existing _repair_stale_pending / _apply_core_sync_or_error_marker. The existing repair flow already handles the no-token-recovered case (puts up an "agent was responding when interrupted" marker). The WAL layer fills in token-level recovery on top, only when a WAL file is present.

Salvage map from #1353

Direct salvage candidates with Co-authored-by: Josh Jameson <…> attribution:

  • JSONL event format{type: "token"|"reasoning"|"tool"|"tool_result"|"start"|"end"|"apperror", timestamp, ...payload}. Sound shape, no rework needed.
  • api/wal.py:replay_wal() — the event accumulation logic (concatenate token text, concatenate reasoning, list tool calls + results, surface had_error). Direct port.
  • api/wal.py:read_wal() + delete_wal() — straightforward primitives.
  • api/models.py:_replay_wal_recovery() gating — the four-condition guard (active_stream_id set + stream not in STREAMS + last message is user role + WAL file exists). Sound logic.
  • _wal_recovered: True marker + UI "Recovered" badge — useful user-facing affordance.

Build-fresh items:

  • No per-session in-memory buffer. Skip the _write_buffer + _token_counts + _last_flush_time dicts entirely. A single shared writer thread reading from a Queue gives the same ≤3s recovery bound with simpler semantics.
  • No _should_flush() / _WAL_FLUSH_TOKENS / _WAL_FLUSH_INTERVAL machinery. The shared writer thread drains the queue with queue.Queue.get(timeout=3.0) — every event flushes within min(N, 3s) automatically. No threshold tuning, no dead-code flush helpers.
  • Wire into existing STREAM_PARTIAL_TEXT[stream_id] callsite at api/streaming.py:1643 rather than duplicating accumulation in parallel. Pattern:
    if stream_id in STREAM_PARTIAL_TEXT:
        STREAM_PARTIAL_TEXT[stream_id] += str(text)
        _wal.put('token', stream_id, text)   # ← single 1-line addition
  • No _checkpoint_activity bump in on_token. As noted above, that change alone does nothing useful — the checkpoint thread persists state that's already on disk. The WAL is the load-bearing change; the existing checkpoint thread can stay as it is (it serves a different role, persisting bookkeeping in the success path).

Implementation sketch (for whoever picks this up)

Phase 1 — primitive (no streaming integration yet)

# api/wal.py — fresh implementation
import json, queue, threading, time
from pathlib import Path
from api.config import SESSION_DIR

_ENABLED = bool(int(os.environ.get('HERMES_WEBUI_WAL_ENABLED', '0')))
_queue: queue.Queue = queue.Queue(maxsize=10000)
_writer_thread: threading.Thread | None = None

def wal_path(sid: str) -> Path:
    return SESSION_DIR / f"{sid}_wal.jsonl"

def put(event_type: str, sid: str, payload):
    if not _ENABLED:
        return
    _queue.put_nowait({'type': event_type, 'sid': sid, 'ts': int(time.time()), 'payload': payload})

def _writer_loop():
    while True:
        try:
            event = _queue.get(timeout=1.0)
        except queue.Empty:
            continue
        sid = event.pop('sid')
        with open(wal_path(sid), 'a', encoding='utf-8') as f:
            f.write(json.dumps(event, ensure_ascii=False) + '\n')

def _ensure_writer():
    global _writer_thread
    if _writer_thread is None or not _writer_thread.is_alive():
        _writer_thread = threading.Thread(target=_writer_loop, daemon=True, name='wal-writer')
        _writer_thread.start()

Phase 2 — streaming integration

5 call sites in api/streaming.py:

  1. Stream start (after STREAM_PARTIAL_TEXT[stream_id] = '' at line 1442): _wal.put('start', stream_id, {'session_id': session_id}) + _wal.ensure_writer().
  2. on_token (after STREAM_PARTIAL_TEXT[stream_id] += str(text) at line 1643): _wal.put('token', stream_id, str(text)).
  3. on_reasoning (after STREAM_REASONING_TEXT[stream_id] += str(text) at line 1656): _wal.put('reasoning', stream_id, str(text)).
  4. on_tool start + completed paths: _wal.put('tool', stream_id, {...}), _wal.put('tool_result', stream_id, {...}).
  5. Stream end success path (just before the STREAM_PARTIAL_TEXT.pop(stream_id, None) cleanup at line 2778): _wal.put('end', stream_id, {'clean': True}) + _wal.delete_wal_after_drain(stream_id) (helper that waits for the writer thread to drain its queue, then deletes the WAL file).

Critical: the cleanup delete_wal() must only run on the clean success path, not in the finally block. Cancel/error paths leave the WAL for replay.

Phase 3 — recovery integration

api/models.py:_replay_wal_recovery() ported from #1353 with minor adjustments:

  • Hook called from get_session() after the existing _repair_stale_pending flow, so the no-token case still gets the "interrupted" marker.
  • Gating: active_stream_id set + stream not in STREAMS + last message is user role + wal_path(sid).exists().
  • Replay → append assistant message with _wal_recovered: True and conditional _partial: True (heuristic: ends in alphanumeric without sentence-ending punctuation).
  • Clear pending state, save, delete WAL.

Phase 4 — tests

  • Unit: round-trip event write → read → replay (port from tests/test_wal_recovery.py).
  • Critical: a non-integration regression test that exercises the streaming path with a stub agent and asserts the WAL file gets at least one start event after chat/start returns. This is the test that would have caught the writer-disconnect class of bug in Fix/chat history WAL recovery #1353.
  • Integration test (gated, optional): full crash-and-recover against a live server.

Phase 5 — rollout

  • Ship Phase 1+2+3+4 with HERMES_WEBUI_WAL_ENABLED=0 default. Document the env var. Solicit feedback from canary users.
  • After at least one full release cycle of canary feedback, flip default to 1.

Acceptance criteria

  • HERMES_WEBUI_WAL_ENABLED=1 + kill -9 mid-stream + reload → assistant message appears with _wal_recovered: True, "Recovered" badge visible, _partial flag accurate.
  • HERMES_WEBUI_WAL_ENABLED=0 (default) → zero WAL files written, zero perf change vs current master.
  • No regression in p50/p95 streaming latency on enabled mode (measured against a 10K-token text stream).
  • git grep "_wal\." in api/ shows write call sites in streaming.py and read call sites in models.py. No dead-code wrappers.
  • Non-integration test asserts WAL file gets a start event during a stub-agent stream — fails if streaming integration is removed.
  • User-facing docs in docs/ explain the env var and the recovery affordance.

Priority

Medium. Crash-recovery is real but rare in practice — the user has to be killed mid-stream during a pure-text reply. The cheap part of the fix (UI affordance + repair-on-load) already works via _repair_stale_pending. The WAL is the upgrade from "agent was responding when interrupted" to "here is the actual partial reply." Worth doing carefully, not urgently.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requeststreamingSSE streaming, gateway sync, real-time updatestrackingTracking issue for follow-up work

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions