Skip to content

fix(backend): snapshot compaction for AGUI events to prevent OOM#914

Open
Gkrumbach07 wants to merge 1 commit intomainfrom
worktree-agui-snapshot-compaction
Open

fix(backend): snapshot compaction for AGUI events to prevent OOM#914
Gkrumbach07 wants to merge 1 commit intomainfrom
worktree-agui-snapshot-compaction

Conversation

@Gkrumbach07
Copy link
Contributor

Summary

  • Backend was OOMKilled (512Mi limit) due to loading 36K+ events into memory per SSE client reconnect
  • Implements AG-UI snapshot compaction: collapses finished event streams into MESSAGES_SNAPSHOT events (36K events → ~3 events)
  • Caches compacted snapshots to disk (agui-events-compacted.jsonl) with atomic writes; subsequent reads serve from cache
  • Removes dead compactStreamingEvents delta compaction code (180 lines, replaced by snapshot compaction)
  • Bumps backend memory limit from 512Mi → 768Mi as safety net

Key changes

  • compactToSnapshots() — assembles TEXT_MESSAGE and TOOL_CALL sequences into Message objects per AG-UI spec
  • loadEventsForReplay() — serves cached snapshots for finished sessions, raw events for active runs
  • Cache invalidation on RUN_STARTED and RUN_ERROR
  • Uses strings.Builder for O(n) delta concatenation (was O(n²) via +=)
  • Reuses existing readJSONLFile helper instead of duplicating

Test plan

  • All existing websocket tests pass (31 tests)
  • New TestCompactToSnapshots — verifies text messages, tool calls, RAW passthrough, metadata preservation
  • New TestLoadEventsForReplay — verifies finished/active session handling, cache write/read, cache invalidation
  • go vet, gofmt, go build all clean
  • Deploy to dev cluster and verify SSE reconnect works for finished and active sessions

🤖 Generated with Claude Code

The backend was OOMKilled (512Mi limit) when replaying large event
streams for finished sessions. Multiple concurrent SSE clients each
loaded 36K+ events into memory and ran delta compaction, exceeding
the memory limit within ~44 seconds.

This implements AG-UI snapshot compaction per the serialization spec:
finished sessions are collapsed into MESSAGES_SNAPSHOT events (36K
events → ~3 events), cached to disk, and served from cache on
subsequent reads.

Changes:
- Add compactToSnapshots() using AG-UI MESSAGES_SNAPSHOT pattern
- Add disk caching (agui-events-compacted.jsonl) with atomic writes
- Invalidate cache on RUN_STARTED and RUN_ERROR events
- Use strings.Builder for O(n) delta concatenation (was O(n²))
- Reuse existing readJSONLFile helper instead of duplicating
- Remove dead compactStreamingEvents (180 lines, no longer called)
- Bump backend memory limit from 512Mi to 768Mi as safety net

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Walkthrough

The changes refactor event handling in the websocket backend by replacing delta-based compaction with a snapshot-based approach. Finished event streams are now converted to MESSAGES_SNAPSHOT events containing fully assembled messages and tool calls. The proxy handler is simplified to use a new loadEventsForReplay function uniformly. Backend memory resource limits are increased.

Changes

Cohort / File(s) Summary
Event Compaction and Replay
components/backend/websocket/agui_store.go
Introduced compactToSnapshots processor that aggregates TEXT_MESSAGE_START/CONTENT/END and TOOL_CALL_START/ARGS/END event sequences into MESSAGES_SNAPSHOT objects. Added loadEventsForReplay that serves cached snapshots or loads raw events and compacts them for finished runs. Added writeCompactedFile for atomic snapshot persistence. Invalidates compacted cache on RunStarted or RunError events.
Proxy Handler Simplification
components/backend/websocket/agui_proxy.go
Removed conditional finish-vs-active run detection logic and associated event compaction/streaming branches. Now uniformly streams events via SSE after loading through loadEventsForReplay, delegating replay semantics handling to the store layer.
Event Handling Tests
components/backend/websocket/agui_store_test.go
Added comprehensive tests for compactToSnapshots covering text message collapsing, tool call handling, RAW event passthrough, metadata preservation, and empty input. Added loadEventsForReplay tests validating cache behavior, finished vs. active session handling, and cache invalidation on new RUN_STARTED events.
Resource Configuration
components/manifests/base/backend-deployment.yaml
Increased backend-api container memory requests from 128Mi to 256Mi and memory limits from 512Mi to 768Mi.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: implementing snapshot compaction for AGUI events to prevent out-of-memory issues.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, clearly explaining the problem, solution, key changes, and test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch worktree-agui-snapshot-compaction
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/backend/websocket/agui_store_test.go`:
- Around line 521-522: The test currently uses a fixed sleep to wait for the
async cache write which is racy; update the test to synchronize
deterministically: either change writeCompactedFile to expose a synchronization
primitive (return a done channel or accept a *sync.WaitGroup) and wait on that
in the test, or replace the time.Sleep in agui_store_test.go with a polling loop
that checks for the file’s existence (os.Stat) with a short interval and overall
timeout (failing the test if timeout elapses). Remove the time.Sleep(100 *
time.Millisecond) and use the chosen synchronization approach around
writeCompactedFile to avoid flakes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 13c147d5-0432-4476-9f09-606d669d7fd3

📥 Commits

Reviewing files that changed from the base of the PR and between 4c40817 and 8de8a27.

📒 Files selected for processing (4)
  • components/backend/websocket/agui_proxy.go
  • components/backend/websocket/agui_store.go
  • components/backend/websocket/agui_store_test.go
  • components/manifests/base/backend-deployment.yaml

@ambient-code ambient-code bot deleted a comment from Gkrumbach07 Mar 17, 2026
@ambient-code
Copy link
Contributor

ambient-code bot commented Mar 17, 2026

Review Queue Status

Check Status Detail
CI FAIL Backend Unit Tests, summary
Conflicts pass
Reviews pass

Action needed: Fix CI failures

Auto-generated by Review Queue workflow. Updated when PR changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant