fix(codex-api): complete chat SSE on finish_reason #9000
Closed
+277
−42
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Windows CI intermittently flakes in
codex-app-serverintegration tests that use WireMock to stub Chat Completions streaming (/v1/chat/completions). The mock expects multiple model requests, but WireMock often only sees the first request and then panics on drop with verification errors like "expected N requests, got 1".Root cause
Our Chat Completions SSE parser treated the transport sentinel (
DONE/[DONE]) as the primary completion signal. On Windows/WireMock the HTTP connection can stay open after the model already indicates completion viafinish_reason(e.g.tool_calls), and the sentinel may not be observed promptly. That means we didn’t emitResponseEvent::Completed, so core kept waiting and never issued the follow-up model request.Fix
Treat
finish_reasonas authoritative for end-of-response and emitResponseEvent::Completedimmediately:finish_reason == "tool_calls": emit all tool call items (and flush any reasoning), then complete.finish_reason == "stop": flush assistant/reasoning, then complete.DONE/[DONE]as a compatibility completion signal, but correctness no longer depends on it.To address correctness/risk concerns, the implementation now includes detailed in-code documentation about these edge cases and ordering guarantees.
Regression coverage
Added
codex-apiunit tests that simulate the problematic condition (a stream that emits a single JSON event withfinish_reasonand then never closes / never sendsDONE/[DONE]), including a multi-choicetool_callschunk. These tests ensure we still emit a terminalCompletedevent and return without relying on sentinel/close.Background doc:
docs/ci/windows-wiremock-chat-sse-flake.md.Tests
cargo test -p codex-api --libcargo test -p codex-app-server codex_message_processor_flow