[Bugfix]: missing partial content if openai tool calling is enabled#28122
[Bugfix]: missing partial content if openai tool calling is enabled#28122dr75 wants to merge 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a bug where partial content was being dropped when OpenAI tool calling is enabled and the token limit is reached. The fix correctly captures the parser.current_content for partial final messages, aligning its behavior with the non-tool-calling path. The addition of specific test cases for partial responses, both with and without tool calls, is excellent and ensures the fix is well-verified. The code change is logical and directly solves the described issue.
|
Do you know if this is also an issue in the streaming tool calling case? Or whether we need a potentially separate fix there? |
|
This is properly handled in the streaming case as tokens from current content are sent while streaming and when |
|
@bbrowning, here is a streaming issue fix that is related but different: |
|
Great, thanks for the clarification! |
|
@heheda12345, @yeqcharlotte, would be great if someone could review. |
yeqcharlotte
left a comment
There was a problem hiding this comment.
hey @dr75 thanks for the fixes. wonder if you could also double confirm the behavior with the official oai implementation? will they return partial final messages when limit is reached? also please fix the precommit.
1174eb2 to
738ffd4
Compare
|
@yeqcharlotte the CI failure was unrelated. I rebased, seems fine now. Will try with the OpenAI API to confirm the behaviour. |
|
Actually the OpenAI API doesn't specify the behaviour of cut-off responses due to
Given that a user has to pay for generating such an incomplete response but doesn't get the generated tokens it seems to be incorrect. Considering very long context responses such as long summaries of documents, it also renders As the problem only appears for reasoning models in the non-streaming case I think that it is an issue in the OpenAI implementation. I think we should provide a more consistent implementation. Also, the problem does not occur when tool calling is disabled making it even more inconsistent. @yeqcharlotte , wdyt? |
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
738ffd4 to
006b024
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
The OpenAIToolParser overrides the code of regular response parsing when enabled. This breaks partial responses when the token limit is reached and harmony does not generate a final message but only
current_contentis left, which is ignored.To solve this, return the parsers
current_contentas it is also done in the non-tool calling case.The code should probably be refactored such that the logic from parse_chat_output() is not duplicated here. Actually, most of the parsing is already done there and repeated here. Leaving this refactoring for a separate PR.
Test Plan
Test Result
max_completion_tokens=5, stopping in reasoningmax_completion_tokens=50, stopping in final message