WIP: stop persisting empty assistant messages, tolerate corrupt tool_use args by birdayz · Pull Request #123 · redpanda-data/ai-sdk-go

birdayz · 2026-04-28T11:25:26Z

What

Two related changes that together unwedge sessions hit by the empty-content / corrupt-tool-args failure modes around #116.

Agent loop skips persistence when len(resp.Message.Content) == 0. Empty assistant messages no longer land in session state, where Anthropic and most other providers reject every subsequent replay with messages.N.content: Field required.
ToolRequest.ArgumentsAsObject() collapses nil / invalid Arguments to an empty object, so request mappers can feed provider tool_use APIs without faceplanting on truncated streaming state. Bedrock and Gemini request mappers adopt it. Anthropic and OpenAI Compatible request mappers still need to switch over — that is the WIP part.

Why

#116 didn't introduce a new bug — it exposed a pre-existing one. Pre-#116, a stream cut off mid-tool-args left partial JSON in Content, so the message was non-empty and the persistence path was fine; the corruption surfaced differently (the JSON-parse-error wedge that #116 already covers). Post-#116, the partial tool_use is correctly dropped, but if it was the only block, Content is empty — and the agent loop appends that empty Message to session state, locking the session forever.

Sources of empty content observed in production:

max_tokens hit before any content block was emitted
the only content block was a partial tool_use that stream finalisation correctly dropped (fix: surface truncation instead of corrupting state when tool_use is cut off #116)
refusal / safety filter with no text emitted

In all of these the FinishReason carries the truth (Length, ContentFilter, etc.) and the agent loop's existing terminal-reason handling fires. The bug was that persistence ran before the terminal-reason check, so the empty Message landed in session state regardless of how cleanly the loop terminated. Stayed wedged across pod restarts because the empty Message was already persisted.

The corrupt-tool-args side of the same problem: a session that already has truncated JSON in its history needs to be replayable. Erroring at json.Unmarshal time means the model can never recover, even though the paired tool_result carries the original parse error and the model has enough context to retry.

Implementation details

Persistence guard at the session-store boundary (agent/llmagent/llmagent.go). One-line check: skip the append when content is empty. The MessageEvent still fires (observers see what happened) and the FinishReason still propagates (terminal-reason handling fires below). Single chokepoint for "what gets persisted" — guard it there, not in every provider. Same shape as adk-go's AppendEvent honouring Event.Partial.

Provider response mappers do not error on empty content. Empty content is a real wire state; the FinishReason already encodes it correctly. Erroring at the mapper would swallow the Length / ContentFilter signal that the agent loop needs to terminate cleanly. Earlier iteration of this PR enforced the invariant at the mapper — reverted in favour of the session-store guard.

ToolRequest.ArgumentsAsObject() in llm/part.go. Decodes Arguments into a JSON object, falling back to an empty object on nil bytes or invalid JSON. The paired tool_result already carries the original parse error so the model has context to retry instead of every replay dying in mapping.

WIP scope. Anthropic and OpenAI Compatible request mappers still call json.Unmarshal directly on tool arguments and will fail on truncated state. The wedged-session reproducer in providers/anthropic/request_mapper_test.go is currently red and pins the next step.

References

fix: surface truncation instead of corrupting state when tool_use is cut off #116 — surface truncation instead of corrupting state when tool_use is cut off
docs/empty-content-persistence.md — design notes on where the invariant should live

Sibling case to PR #116. Where #116 dealt with corrupt content (truncated tool args during streaming), this addresses absent content: empty assistant messages landing in session state, where Anthropic and most other providers reject any subsequent replay with `messages.N.content: Field required` and the conversation is permanently wedged. ## Why #116 didn't introduce a new bug — it exposed a pre-existing one. Pre-#116, a stream cut off mid-tool-args left partial JSON in `Content`, so the message was non-empty and the persistence path was fine; the corruption surfaced differently (the JSON-parse-error wedge that #116 already covers). Post-#116, the partial tool_use is correctly dropped, but if it was the only block, `Content` is empty — and the agent loop appends that empty `Message` to session state, locking the session forever. Sources of empty content we have observed: - max_tokens hit before any content block was emitted - the only content block was a partial tool_use that stream finalisation correctly dropped (#116) - refusal / safety filter with no text emitted In all of these the FinishReason carries the truth (Length, ContentFilter, etc.) and the agent loop's existing terminal-reason handling fires. The bug was that the persistence step at line 273 ran *before* the terminal-reason check at line 285, so the empty Message landed in session state regardless of how cleanly the loop otherwise terminated. Discovered after #116 deployed: an agent session that had been running for ~30 minutes started failing every call with `messages.15.content: Field required`. Stayed wedged across pod restarts because the empty Message was already persisted. ## Fix One-line guard at the session-store boundary (agent/llmagent/llmagent.go:273): skip persistence when `len(resp.Message.Content) == 0`. The MessageEvent still fires (observers see what happened) and the FinishReason still propagates (terminal-reason handling fires below); only persistence is skipped. Same shape as adk-go's `AppendEvent` honouring `Event.Partial`. The session-store boundary is the single chokepoint for "what gets persisted" — guard it there, don't try to enforce the invariant in every provider. Provider response mappers are unchanged. Empty content is a real state in the wild (truncation, refusal, dropped partial blocks); the FinishReason already encodes it correctly. Erroring at the provider would swallow the Length / ContentFilter signal that the agent loop needs to terminate cleanly. ## Tests agent/llmagent/empty_response_test.go locks in: - empty Message not persisted to session - non-empty Message still persisted (negative control — guard must not over-fire) - FinishReason still propagates (Length terminal handling fires) - MessageEvent still fires (observers see the empty response) ## References PR #116: surface truncation instead of corrupting state when tool_use is cut off

…ng at providers Two related changes that together let wedged sessions recover and narrow provider responsibility: 1. ToolRequest.ArgumentsAsObject() — decode Arguments into a JSON object, falling back to an empty object on nil bytes or invalid JSON. Lets request mappers feed provider APIs (Anthropic tool_use.input, Bedrock tool_use.Input, Gemini function_call.args) without faceplanting on truncated streaming state. The paired tool_result already carries the original parse error so the model has context to retry. 2. Drop empty-content error in provider response mappers. Empty content is a real wire state (max_tokens before any block, refusal, dropped partial tool_use). FinishReason already encodes it correctly; erroring at the mapper swallowed Length / ContentFilter signals the agent loop needs. The session-store guard added in the previous commit is the right place for the invariant. Bedrock and Gemini request mappers switch to ArgumentsAsObject(). Anthropic and OpenAI Compatible request mappers still use raw json.Unmarshal — TODO before lift. Tests: - llm/part_test.go: ArgumentsAsObject contract - providers/anthropic/request_mapper_test.go: wedged-session replay reproducer (currently red — anthropic request mapper not yet switched over) WIP: anthropic and openaicompat request mappers still need to adopt ArgumentsAsObject for the wedged-session replay path to work end-to-end.

blacksmith-sh · 2026-04-30T07:21:45Z

Found 2 test failures on Blacksmith runners:

Failures

Test	View Logs
`github.com/redpanda-data/ai-sdk-go/plugins/retry/TestRequestMapper_EmptyToolArguments`	View Logs
`github.com/redpanda-data/ai-sdk-go/plugins/retry/TestRequestMapper_WedgedSessionReplay`	View Logs

birdayz force-pushed the jb/empty-content-poisoning branch 8 times, most recently from d88383c to 99088ec Compare April 28, 2026 12:05

birdayz added 2 commits April 30, 2026 09:17

birdayz force-pushed the jb/empty-content-poisoning branch from 99088ec to bcbf132 Compare April 30, 2026 07:18

birdayz changed the title ~~fix: surface and stop persisting empty-content responses~~ WIP: stop persisting empty assistant messages, tolerate corrupt tool_use args Apr 30, 2026

birdayz marked this pull request as draft April 30, 2026 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: stop persisting empty assistant messages, tolerate corrupt tool_use args#123

WIP: stop persisting empty assistant messages, tolerate corrupt tool_use args#123
birdayz wants to merge 2 commits into
mainfrom
jb/empty-content-poisoning

birdayz commented Apr 28, 2026 •

edited

Loading

Uh oh!

blacksmith-sh Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

birdayz commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Implementation details

References

Uh oh!

blacksmith-sh Bot commented Apr 30, 2026

Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

birdayz commented Apr 28, 2026 •

edited

Loading