Skip to content

[Bugfix]: missing partial content if openai tool calling is enabled#28122

Open
dr75 wants to merge 3 commits intovllm-project:mainfrom
dr75:gpt-oss-tools-parser
Open

[Bugfix]: missing partial content if openai tool calling is enabled#28122
dr75 wants to merge 3 commits intovllm-project:mainfrom
dr75:gpt-oss-tools-parser

Conversation

@dr75
Copy link
Contributor

@dr75 dr75 commented Nov 5, 2025

Purpose

The OpenAIToolParser overrides the code of regular response parsing when enabled. This breaks partial responses when the token limit is reached and harmony does not generate a final message but only current_content is left, which is ignored.

To solve this, return the parsers current_content as it is also done in the non-tool calling case.

The code should probably be refactored such that the logic from parse_chat_output() is not duplicated here. Actually, most of the parsing is already done there and repeated here. Leaving this refactoring for a separate PR.

Test Plan

  • Manually tested with chat requests with limited max_tokens while the parser is enabled.
  • Added test cases.

Test Result

max_completion_tokens=5, stopping in reasoning

--- Request: Tell me something about confidential computing. ---
(Need to)

max_completion_tokens=50, stopping in final message

--- Request: Tell me something about confidential computing. ---
(Need explain concept.)

### What is Confidential Computing?

Confidential computing is a set of hardware‑enforced

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a bug where partial content was being dropped when OpenAI tool calling is enabled and the token limit is reached. The fix correctly captures the parser.current_content for partial final messages, aligning its behavior with the non-tool-calling path. The addition of specific test cases for partial responses, both with and without tool calls, is excellent and ensures the fix is well-verified. The code change is logical and directly solves the described issue.

@bbrowning
Copy link
Contributor

Do you know if this is also an issue in the streaming tool calling case? Or whether we need a potentially separate fix there?

@dr75
Copy link
Contributor Author

dr75 commented Nov 5, 2025

This is properly handled in the streaming case as tokens from current content are sent while streaming and when the client it aborts due to max tokens reached, everything streamed so far has already been received by the client.

@dr75
Copy link
Contributor Author

dr75 commented Nov 5, 2025

@bbrowning, here is a streaming issue fix that is related but different:
#28139

@bbrowning
Copy link
Contributor

Great, thanks for the clarification!

@dr75
Copy link
Contributor Author

dr75 commented Nov 28, 2025

@heheda12345, @yeqcharlotte, would be great if someone could review.

Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @dr75 thanks for the fixes. wonder if you could also double confirm the behavior with the official oai implementation? will they return partial final messages when limit is reached? also please fix the precommit.

@dr75 dr75 force-pushed the gpt-oss-tools-parser branch from 1174eb2 to 738ffd4 Compare December 10, 2025 13:00
@dr75
Copy link
Contributor Author

dr75 commented Dec 10, 2025

@yeqcharlotte the CI failure was unrelated. I rebased, seems fine now.

Will try with the OpenAI API to confirm the behaviour.

@dr75
Copy link
Contributor Author

dr75 commented Dec 10, 2025

Actually the OpenAI API doesn't specify the behaviour of cut-off responses due to max_completion_tokens for reasoning models. The actual implementation shows the same behaviour as in vLLM (before this change) and is inconsistent:

  • with gpt 4.1 (non-reasoning) I get the incomplete response
  • with gpt 5.1 (reasoning) I get an empty response
  • with gpt 5.1 (reasoning and stream) I get the incomplete response

Given that a user has to pay for generating such an incomplete response but doesn't get the generated tokens it seems to be incorrect. Considering very long context responses such as long summaries of documents, it also renders max_completion_tokens nearly useless as a user risks getting nothing if the limit is too low so I have to specify a very high limit. It is also confusing as the first assumption is that the model spent all the time with reasoning. A workaround is to switch to streaming responses where I do get the partial response.

As the problem only appears for reasoning models in the non-streaming case I think that it is an issue in the OpenAI implementation.

I think we should provide a more consistent implementation. Also, the problem does not occur when tool calling is disabled making it even more inconsistent.

@yeqcharlotte , wdyt?

@dr75 dr75 requested a review from yeqcharlotte December 10, 2025 16:32
dr75 added 3 commits December 16, 2025 13:56
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
@dr75 dr75 force-pushed the gpt-oss-tools-parser branch from 738ffd4 to 006b024 Compare December 16, 2025 13:58
@mergify
Copy link

mergify bot commented Dec 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dr75.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 21, 2025
@mergify mergify bot added the bug Something isn't working label Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants