Crash recovery: WAL design + integration for in-flight assistant tokens

# Crash recovery: WAL design + integration for in-flight assistant tokens

## Background

When `hermes-webui` is killed (OOM, SIGKILL, container restart, kernel panic) during an active streaming agent run, tokens streamed since the last checkpoint are lost. The session JSON reverts to showing only the user's pending message; the assistant reply that was mid-stream simply vanishes.

The diagnosis from PR #1353 (@JKJameson) is correct on both counts:

1. **`_checkpoint_activity` is incremented only in `on_tool()` — so a pure-text stream with zero tool calls is never flagged for periodic checkpointing.**
2. **There is no token-level durability primitive.** `STREAM_PARTIAL_TEXT[stream_id]` and `STREAM_REASONING_TEXT[stream_id]` accumulate the in-flight tokens in memory (introduced in #893 and #1361), but they live only in process memory and are lost on crash.

But the further wrinkle is: the `_checkpoint_activity` increment alone wouldn't actually fix the bug. `s.messages` is never mutated during streaming — it only updates at `api/streaming.py:2141` *after* `agent.run_conversation()` returns. The periodic 15s checkpoint thread saves `s.pending_user_message` + `s.active_stream_id` + `s.pending_started_at`, all of which are already on disk from the pre-stream `s.save()` at line 2092. So even with `_checkpoint_activity` bumping every token, the checkpoint thread persists nothing new during a text-only stream.

The real fix is shadowing the in-memory token accumulators (`STREAM_PARTIAL_TEXT`, `STREAM_REASONING_TEXT`) to a sidecar log on disk, then replaying that log on session reload after a crash.

## User impact

This bug class has been hit by users:
- #1217 — "Webui lost my prompts" (closed; root cause was different but related — context compaction; the user-facing symptom — message vanished — is what we're addressing here).
- The streaming-loss path is silent: there is no error, no toast, no "session was interrupted" indicator. The user simply sends a message, sees the agent start to type, then on next reload the message and reply are gone.

## Design constraints

From the closer reading after #1353:

1. **Durability target is "after-crash," not "after-power-loss."** Per-event `fsync()` is overkill — the kernel's page cache will preserve append-only writes through any process death short of an unsynchronized power outage, which is not in scope.

2. **Cannot regress streaming throughput.** A model emitting 60 tokens/sec must not see measurable latency increase from the durability layer. This rules out synchronous `fsync()` per token on the streaming hot path.

3. **Cannot regress disk I/O on shared storage.** Some users mount `~/.hermes` on NFS or other network filesystems. The durability layer must batch writes — at most one `write()` syscall per N tokens, where N is large enough that NFS write amplification is bounded.

4. **Must work without a dedicated background thread per session.** Spawning a writer thread per active stream is expensive at scale (e.g. multi-tab cron-driven sessions); a single shared writer thread with a `queue.Queue` of `(session_id, event)` tuples is the right shape.

5. **Must be opt-in until perf is measured.** Ship behind `HERMES_WEBUI_WAL_ENABLED=1` env var; default off; flip the default after at least one full release cycle of canary feedback from users running on slow disks / network homedirs.

6. **Recovery side must be cheap on every session load.** A `Path.exists()` check per `get_session()` is fine; reading and parsing a JSONL file on every session load is not. The existence check should short-circuit the no-WAL fast path within microseconds.

7. **Must compose with existing `_repair_stale_pending` / `_apply_core_sync_or_error_marker`.** The existing repair flow already handles the no-token-recovered case (puts up an "agent was responding when interrupted" marker). The WAL layer fills in token-level recovery on top, only when a WAL file is present.

## Salvage map from #1353

Direct salvage candidates with `Co-authored-by: Josh Jameson <…>` attribution:

- **JSONL event format** — `{type: "token"|"reasoning"|"tool"|"tool_result"|"start"|"end"|"apperror", timestamp, ...payload}`. Sound shape, no rework needed.
- **`api/wal.py:replay_wal()`** — the event accumulation logic (concatenate token text, concatenate reasoning, list tool calls + results, surface `had_error`). Direct port.
- **`api/wal.py:read_wal()` + `delete_wal()`** — straightforward primitives.
- **`api/models.py:_replay_wal_recovery()` gating** — the four-condition guard (`active_stream_id` set + stream not in `STREAMS` + last message is user role + WAL file exists). Sound logic.
- **`_wal_recovered: True` marker + UI "Recovered" badge** — useful user-facing affordance.

Build-fresh items:

- **No per-session in-memory buffer.** Skip the `_write_buffer` + `_token_counts` + `_last_flush_time` dicts entirely. A single shared writer thread reading from a `Queue` gives the same ≤3s recovery bound with simpler semantics.
- **No `_should_flush()` / `_WAL_FLUSH_TOKENS` / `_WAL_FLUSH_INTERVAL` machinery.** The shared writer thread drains the queue with `queue.Queue.get(timeout=3.0)` — every event flushes within `min(N, 3s)` automatically. No threshold tuning, no dead-code flush helpers.
- **Wire into existing `STREAM_PARTIAL_TEXT[stream_id]` callsite** at `api/streaming.py:1643` rather than duplicating accumulation in parallel. Pattern:
  ```python
  if stream_id in STREAM_PARTIAL_TEXT:
      STREAM_PARTIAL_TEXT[stream_id] += str(text)
      _wal.put('token', stream_id, text)   # ← single 1-line addition
  ```
- **No `_checkpoint_activity` bump in `on_token`.** As noted above, that change alone does nothing useful — the checkpoint thread persists state that's already on disk. The WAL is the load-bearing change; the existing checkpoint thread can stay as it is (it serves a different role, persisting bookkeeping in the success path).

## Implementation sketch (for whoever picks this up)

### Phase 1 — primitive (no streaming integration yet)

```python
# api/wal.py — fresh implementation
import json, queue, threading, time
from pathlib import Path
from api.config import SESSION_DIR

_ENABLED = bool(int(os.environ.get('HERMES_WEBUI_WAL_ENABLED', '0')))
_queue: queue.Queue = queue.Queue(maxsize=10000)
_writer_thread: threading.Thread | None = None

def wal_path(sid: str) -> Path:
    return SESSION_DIR / f"{sid}_wal.jsonl"

def put(event_type: str, sid: str, payload):
    if not _ENABLED:
        return
    _queue.put_nowait({'type': event_type, 'sid': sid, 'ts': int(time.time()), 'payload': payload})

def _writer_loop():
    while True:
        try:
            event = _queue.get(timeout=1.0)
        except queue.Empty:
            continue
        sid = event.pop('sid')
        with open(wal_path(sid), 'a', encoding='utf-8') as f:
            f.write(json.dumps(event, ensure_ascii=False) + '\n')

def _ensure_writer():
    global _writer_thread
    if _writer_thread is None or not _writer_thread.is_alive():
        _writer_thread = threading.Thread(target=_writer_loop, daemon=True, name='wal-writer')
        _writer_thread.start()
```

### Phase 2 — streaming integration

5 call sites in `api/streaming.py`:

1. **Stream start** (after `STREAM_PARTIAL_TEXT[stream_id] = ''` at line 1442): `_wal.put('start', stream_id, {'session_id': session_id})` + `_wal.ensure_writer()`.
2. **`on_token`** (after `STREAM_PARTIAL_TEXT[stream_id] += str(text)` at line 1643): `_wal.put('token', stream_id, str(text))`.
3. **`on_reasoning`** (after `STREAM_REASONING_TEXT[stream_id] += str(text)` at line 1656): `_wal.put('reasoning', stream_id, str(text))`.
4. **`on_tool`** start + completed paths: `_wal.put('tool', stream_id, {...})`, `_wal.put('tool_result', stream_id, {...})`.
5. **Stream end success path** (just before the `STREAM_PARTIAL_TEXT.pop(stream_id, None)` cleanup at line 2778): `_wal.put('end', stream_id, {'clean': True})` + `_wal.delete_wal_after_drain(stream_id)` (helper that waits for the writer thread to drain its queue, then deletes the WAL file).

Critical: the cleanup `delete_wal()` must only run on the **clean success path**, not in the `finally` block. Cancel/error paths leave the WAL for replay.

### Phase 3 — recovery integration

`api/models.py:_replay_wal_recovery()` ported from #1353 with minor adjustments:

- Hook called from `get_session()` after the existing `_repair_stale_pending` flow, so the no-token case still gets the "interrupted" marker.
- Gating: `active_stream_id` set + stream not in `STREAMS` + last message is user role + `wal_path(sid).exists()`.
- Replay → append assistant message with `_wal_recovered: True` and conditional `_partial: True` (heuristic: ends in alphanumeric without sentence-ending punctuation).
- Clear pending state, save, delete WAL.

### Phase 4 — tests

- Unit: round-trip event write → read → replay (port from `tests/test_wal_recovery.py`).
- **Critical: a non-integration regression test that exercises the streaming path with a stub agent and asserts the WAL file gets at least one `start` event after `chat/start` returns.** This is the test that would have caught the writer-disconnect class of bug in #1353.
- Integration test (gated, optional): full crash-and-recover against a live server.

### Phase 5 — rollout

- Ship Phase 1+2+3+4 with `HERMES_WEBUI_WAL_ENABLED=0` default. Document the env var. Solicit feedback from canary users.
- After at least one full release cycle of canary feedback, flip default to `1`.

## Acceptance criteria

- [ ] `HERMES_WEBUI_WAL_ENABLED=1` + kill -9 mid-stream + reload → assistant message appears with `_wal_recovered: True`, "Recovered" badge visible, `_partial` flag accurate.
- [ ] `HERMES_WEBUI_WAL_ENABLED=0` (default) → zero WAL files written, zero perf change vs current master.
- [ ] No regression in p50/p95 streaming latency on enabled mode (measured against a 10K-token text stream).
- [ ] `git grep "_wal\."` in `api/` shows write call sites in `streaming.py` and read call sites in `models.py`. No dead-code wrappers.
- [ ] Non-integration test asserts WAL file gets a `start` event during a stub-agent stream — fails if streaming integration is removed.
- [ ] User-facing docs in `docs/` explain the env var and the recovery affordance.

## Priority

Medium. Crash-recovery is real but rare in practice — the user has to be killed mid-stream during a pure-text reply. The cheap part of the fix (UI affordance + repair-on-load) already works via `_repair_stale_pending`. The WAL is the upgrade from "agent was responding when interrupted" to "here is the actual partial reply." Worth doing carefully, not urgently.

## References

- Original PR with diagnosis + WAL primitives: #1353 (@JKJameson)
- User report bookmark: #1217
- Existing partial-text accumulators: #893, #1361
- Existing checkpoint mechanism: #765


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash recovery: WAL design + integration for in-flight assistant tokens #1585

Crash recovery: WAL design + integration for in-flight assistant tokens

Background

User impact

Design constraints

Salvage map from #1353

Implementation sketch (for whoever picks this up)

Phase 1 — primitive (no streaming integration yet)

Phase 2 — streaming integration

Phase 3 — recovery integration

Phase 4 — tests

Phase 5 — rollout

Acceptance criteria

Priority

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Crash recovery: WAL design + integration for in-flight assistant tokens #1585

Description

Crash recovery: WAL design + integration for in-flight assistant tokens

Background

User impact

Design constraints

Salvage map from #1353

Implementation sketch (for whoever picks this up)

Phase 1 — primitive (no streaming integration yet)

Phase 2 — streaming integration

Phase 3 — recovery integration

Phase 4 — tests

Phase 5 — rollout

Acceptance criteria

Priority

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions