Skip to content

fix(backend): stream event logs to prevent OOMKill#890

Open
Gkrumbach07 wants to merge 1 commit intomainfrom
fix/backend-oom-streaming-events-52922
Open

fix(backend): stream event logs to prevent OOMKill#890
Gkrumbach07 wants to merge 1 commit intomainfrom
fix/backend-oom-streaming-events-52922

Conversation

@Gkrumbach07
Copy link
Contributor

@Gkrumbach07 Gkrumbach07 commented Mar 12, 2026

Summary

  • Stream AG-UI event logs with bufio.Scanner instead of os.ReadFile to avoid loading entire JSONL files into memory
  • Apply same streaming fix to readJSONLFile in export.go
  • Cap error response body reads to 1KB with io.LimitReader
  • Increase session map cleanup frequency (10min → 2min interval, 1hr → 10min stale threshold)
  • Add K8s-level pagination (500 items/page) to ListSessions

Root Cause

loadEvents() in agui_store.go used os.ReadFile() to load entire event log files into memory. For long-running sessions, these .jsonl files grow to hundreds of MB. Each GET /agui/events request loaded the full file, causing the backend pod to OOMKill at the 512Mi limit.

Test plan

  • gofmt — clean
  • go vet ./... — clean
  • go build ./... — clean
  • go test ./websocket/... — passing
  • go test ./handlers/... — passing (1 pre-existing failure unrelated to changes)

Fixes: RHOAIENG-52922

Jira: RHOAIENG-52922

The backend was loading entire AG-UI event log files into memory
via os.ReadFile(), causing OOMKill on UAT with 512Mi limit.

- Replace os.ReadFile with bufio.Scanner for streaming event reads
- Apply same streaming fix to readJSONLFile in export.go
- Add io.LimitReader to error response body reads (capped at 1KB)
- Increase session map cleanup frequency (10min -> 2min)
- Reduce stale session threshold (1hr -> 10min)
- Add K8s-level pagination to ListSessions

Fixes: RHOAIENG-52922

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

Walkthrough

The changes optimize memory usage across backend components: introducing pagination for Kubernetes API calls in session listing, converting file reads to streaming approaches with buffers in event loading and JSONL parsing, and tightening session staleness detection with adjusted thresholds and cleanup intervals.

Changes

Cohort / File(s) Summary
Session Listing Pagination
components/backend/handlers/sessions.go
Switches ListSessions from fetching all items in a single API call to paginated retrieval using a page size of 500. Aggregates results across pages using Limit and Continue tokens with per-page error handling and early termination on failure.
WebSocket Session Management
components/backend/websocket/agui_proxy.go
Adds maxErrorBodyBytes constant to cap error response body reads via io.LimitReader. Reduces staleSessionThreshold from 1 hour to 10 minutes and staleSessionCleanupInterval from 10 minutes to 2 minutes for more aggressive session pruning.
Event and Log Streaming
components/backend/websocket/agui_store.go, components/backend/websocket/export.go
Replaces full-file read operations with streaming via bufio.Scanner for loadEvents and readJSONLFile. Introduces 1MB scanner buffer, deferred file closure, per-line JSON parsing, and error handling to avoid loading entire files into memory.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: replacing file loading with streaming to prevent OOMKill. It's concise and directly related to the core objective of the changeset.
Description check ✅ Passed The pull request description accurately reflects all the changes made across the four files and clearly articulates the root cause and motivation for the fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/backend-oom-streaming-events-52922
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/backend/websocket/agui_store.go (1)

162-205: ⚠️ Potential issue | 🟠 Major

loadEvents still materializes the entire event log into memory, and now fails silently on large individual events.

The change from os.ReadFile to bufio.Scanner avoids one buffer allocation but doesn't address the core issue: every decoded event map is appended to the events slice before being returned and replayed. For long-running sessions, this still causes the same OOM scenario. Additionally, if any single JSONL row exceeds the 1 MiB scanner buffer, Scan() returns false with bufio.ErrTooLong, the loop exits, and HandleAGUIEvents returns a partial event history to the client with no indication of truncation. To fix both issues, the replay logic needs to stream events without first building the complete slice in memory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/backend/websocket/agui_store.go` around lines 162 - 205,
loadEvents currently reads and decodes every JSONL event into an in-memory slice
and can silently fail on lines >1MiB; change it to stream-decoded events instead
of materializing them: replace loadEvents(sessionID string)
[]map[string]interface{} with a streaming variant used by HandleAGUIEvents such
as loadEvents(sessionID string, emit func(map[string]interface{}) error) error
(or return a receive-only channel), keep the existing migration logic
(MigrateLegacySessionToAGUI) and file/open retry, then inside the scanner loop
unmarshal each line and immediately call emit(evt) (handling and propagating
errors from emit), and explicitly detect scanner.Err() and bufio.ErrTooLong to
log/return a clear error instead of returning a truncated partial slice.
components/backend/websocket/export.go (1)

202-227: ⚠️ Potential issue | 🟠 Major

This still materializes large exports in memory and caps individual JSONL lines at 1 MiB.

Although file reading is now streaming via bufio.Scanner, the entire decoded dataset is accumulated into the events slice (line 224), then fully re-marshaled with json.MarshalIndent (lines 120, 144) before being sent to the client. This approach fails to address the original peak-memory concern for large session exports. Additionally, any single JSONL line exceeding 1 MiB will cause the scanner to abort with ErrTooLong (line 211 sets the hard limit), making large events unexportable. To handle large exports safely, stream the JSON array directly to c.Writer without buffering the full decoded dataset or imposing per-line size limits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/backend/websocket/export.go` around lines 202 - 227, The current
readJSONLFile function plus subsequent json.MarshalIndent calls (used earlier
when writing to c.Writer) accumulate the whole export in memory and cap per-line
size via Scanner.Buffer; instead, remove readJSONLFile usage and stream the
JSONL file directly to c.Writer: open the file, set response headers, write the
'[' then iterate over lines using a reader that doesn't impose a 1MiB scanner
limit (e.g., bufio.Reader.ReadBytes('\n') or io.Reader with
json.Decoder/streaming), for each line validate or wrap as json.RawMessage and
write comma-separated entries directly to c.Writer (flushing if using a
streaming context), then write ']' and handle errors—this ensures you never
json.MarshalIndent the full events slice and avoids Scanner's ErrTooLong while
keeping the export streamed to c.Writer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/backend/websocket/agui_proxy.go`:
- Around line 76-79: The staleSessionThreshold of 10 minutes causes live
sessions to be evicted because sessionLastSeen is only updated when
HandleAGUIEvents and HandleAGUIRunProxy start handling a request; fix by
ensuring active streams keep the session alive: either raise
staleSessionThreshold to a much larger value (e.g., hours) or, preferably,
update sessionLastSeen periodically while SSE/runner streams are active inside
HandleAGUIEvents and HandleAGUIRunProxy (e.g., touch sessionLastSeen on each
send/heartbeat or in the stream loop), so cleanup won’t evict cached runner
ports and calls that rely on handlers.DefaultRunnerPort fallback won’t break
non-default runners.

---

Outside diff comments:
In `@components/backend/websocket/agui_store.go`:
- Around line 162-205: loadEvents currently reads and decodes every JSONL event
into an in-memory slice and can silently fail on lines >1MiB; change it to
stream-decoded events instead of materializing them: replace
loadEvents(sessionID string) []map[string]interface{} with a streaming variant
used by HandleAGUIEvents such as loadEvents(sessionID string, emit
func(map[string]interface{}) error) error (or return a receive-only channel),
keep the existing migration logic (MigrateLegacySessionToAGUI) and file/open
retry, then inside the scanner loop unmarshal each line and immediately call
emit(evt) (handling and propagating errors from emit), and explicitly detect
scanner.Err() and bufio.ErrTooLong to log/return a clear error instead of
returning a truncated partial slice.

In `@components/backend/websocket/export.go`:
- Around line 202-227: The current readJSONLFile function plus subsequent
json.MarshalIndent calls (used earlier when writing to c.Writer) accumulate the
whole export in memory and cap per-line size via Scanner.Buffer; instead, remove
readJSONLFile usage and stream the JSONL file directly to c.Writer: open the
file, set response headers, write the '[' then iterate over lines using a reader
that doesn't impose a 1MiB scanner limit (e.g., bufio.Reader.ReadBytes('\n') or
io.Reader with json.Decoder/streaming), for each line validate or wrap as
json.RawMessage and write comma-separated entries directly to c.Writer (flushing
if using a streaming context), then write ']' and handle errors—this ensures you
never json.MarshalIndent the full events slice and avoids Scanner's ErrTooLong
while keeping the export streamed to c.Writer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9a3f8d16-70c8-41f0-acc3-373af6ae6499

📥 Commits

Reviewing files that changed from the base of the PR and between 538ccbd and 3a62f0d.

📒 Files selected for processing (4)
  • components/backend/handlers/sessions.go
  • components/backend/websocket/agui_proxy.go
  • components/backend/websocket/agui_store.go
  • components/backend/websocket/export.go

Comment on lines +76 to +79
const staleSessionThreshold = 10 * time.Minute

// staleSessionCleanupInterval is how often the cleanup goroutine runs.
const staleSessionCleanupInterval = 10 * time.Minute
const staleSessionCleanupInterval = 2 * time.Minute
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't expire live session caches after 10 minutes.

sessionLastSeen is only refreshed when HandleAGUIEvents and HandleAGUIRunProxy start handling a request (Lines 141 and 266), not while an SSE stream or runner stream is still active. With a 10-minute threshold, cleanup can evict the cached runner port for a session that is still alive, so later /agui/interrupt, /agui/feedback, and similar calls fall back to handlers.DefaultRunnerPort and break non-default runners.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/backend/websocket/agui_proxy.go` around lines 76 - 79, The
staleSessionThreshold of 10 minutes causes live sessions to be evicted because
sessionLastSeen is only updated when HandleAGUIEvents and HandleAGUIRunProxy
start handling a request; fix by ensuring active streams keep the session alive:
either raise staleSessionThreshold to a much larger value (e.g., hours) or,
preferably, update sessionLastSeen periodically while SSE/runner streams are
active inside HandleAGUIEvents and HandleAGUIRunProxy (e.g., touch
sessionLastSeen on each send/heartbeat or in the stream loop), so cleanup won’t
evict cached runner ports and calls that rely on handlers.DefaultRunnerPort
fallback won’t break non-default runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant