Skip to content

Conversation

@joshka-oai
Copy link
Collaborator

@joshka-oai joshka-oai commented Jan 9, 2026

Problem

Windows CI intermittently flakes in codex-app-server integration tests that use WireMock to stub Chat Completions streaming (/v1/chat/completions). The mock expects multiple model requests, but WireMock often only sees the first request and then panics on drop with verification errors like "expected N requests, got 1".

Root cause

Our Chat Completions SSE parser treated the transport sentinel (DONE/[DONE]) as the primary completion signal. On Windows/WireMock the HTTP connection can stay open after the model already indicates completion via finish_reason (e.g. tool_calls), and the sentinel may not be observed promptly. That means we didn’t emit ResponseEvent::Completed, so core kept waiting and never issued the follow-up model request.

Fix

Treat finish_reason as authoritative for end-of-response and emit ResponseEvent::Completed immediately:

  • finish_reason == "tool_calls": emit all tool call items (and flush any reasoning), then complete.
  • finish_reason == "stop": flush assistant/reasoning, then complete.
  • Still accept DONE/[DONE] as a compatibility completion signal, but correctness no longer depends on it.

To address correctness/risk concerns, the implementation now includes detailed in-code documentation about these edge cases and ordering guarantees.

Regression coverage

Added codex-api unit tests that simulate the problematic condition (a stream that emits a single JSON event with finish_reason and then never closes / never sends DONE/[DONE]), including a multi-choice tool_calls chunk. These tests ensure we still emit a terminal Completed event and return without relying on sentinel/close.

Background doc: docs/ci/windows-wiremock-chat-sse-flake.md.

Tests

  • cargo test -p codex-api --lib
  • cargo test -p codex-app-server codex_message_processor_flow

@joshka-oai joshka-oai force-pushed the joshka/fix-windows-chat-sse-finish-reason branch 2 times, most recently from a9c4f86 to ba78a38 Compare January 9, 2026 21:30
Windows CI flaked in codex-app-server integration tests that use WireMock to
stub the Chat Completions streaming endpoint (/v1/chat/completions).

The mock responses include tool calls and a trailing DONE sentinel, but on
Windows/WireMock the HTTP connection can remain open after the model has
already indicated completion via finish_reason (e.g. "tool_calls"). Our SSE
parser emitted FunctionCall items but didn't emit ResponseEvent::Completed until
it observed the DONE/[DONE] sentinel, so core kept waiting for completion and
never issued the follow-up model request. WireMock then failed verification with
"expected N requests, got 1".

Treat finish_reason as authoritative for end-of-response:
- On finish_reason == "tool_calls", flush reasoning/tool calls and emit
  Completed immediately.
- On finish_reason == "stop", flush assistant/reasoning and emit Completed.
- Keep accepting DONE/[DONE] as a terminal sentinel for compatibility.

Added a short doc describing the Windows/WireMock failure mode and rationale.

Tests:
- cargo test -p codex-api --lib
- cargo test -p codex-app-server codex_message_processor_flow
@joshka-oai joshka-oai force-pushed the joshka/fix-windows-chat-sse-finish-reason branch from ba78a38 to b1cd91d Compare January 9, 2026 21:36
@joshka-oai
Copy link
Collaborator Author

Added better docs and regression test on this to show where the problem happened in our current approach (and to ensure that this doesn't happen again)

@celia-oai
Copy link
Collaborator

does this apply to responses API too? If so can we update the PR descriptions?

I'm prob not the best person to review this corner of the codebase - will leave the review to @pakrym-oai

@joshka-oai joshka-oai closed this Jan 9, 2026
@joshka-oai joshka-oai deleted the joshka/fix-windows-chat-sse-finish-reason branch January 9, 2026 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants