[Frontend][Bug] allow tool calls in analysis channel#28139
[Frontend][Bug] allow tool calls in analysis channel#28139chaunceyjiang merged 9 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where tool calls in the analysis channel were not being handled by the gpt-oss streaming parser. The fix correctly extends the tool call handling logic to include the analysis channel, in addition to the commentary channel. This is a robust solution that aligns with the OpenAI Harmony documentation. The addition of a complex streaming test case that reproduces the issue is a great way to ensure the fix is effective. However, I have one concern regarding the resource requirements of the new test, which I've detailed in a specific comment.
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where tool calls in the analysis channel for gpt-oss models were not being handled correctly by the streaming parser. The proposed solution correctly extends the existing tool call parsing logic for the commentary channel to also include the analysis channel. The change in vllm/entrypoints/openai/serving_chat.py is well-targeted and correctly re-prioritizes the logic to check for tool calls in the analysis channel before treating it as reasoning content. A new complex streaming test case has been added in tests/entrypoints/openai/test_serving_chat.py, which effectively validates the fix. The changes are sound and I have not identified any issues of high or critical severity.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Hrm I feel like the issue here is actually that we are putting these tool calls on the analysis channel in the first place. Do you know why that happens in the first place vs it always being on the commentary channel? I couldn't figure it out from a quick look at the code |
|
See the OpenAI docs and the limited docs and comments in the harmony repo. The problem is that you cannot force the model to put the recipient where it does, except you would use some guided decoding. For that reason, harmony allows for both. While usually it appears on the commentary channel, rendering previous tool calls to the analysis channel (which is the choice the harmony lib made - see comments) causes the model to switch as well. But in any case it wouldn't be a solution for every case as internal tools seem to be emitted on the analysis channel (see open ai docs). As long as you don't get a different answer from OpenAI on the respective Harmony PR, this fix here seems to be the right solution to me. And I guess other frameworks are doing it the same way. But I didn't check. |
|
BTW I completely agree this is a problem in general, I just want to make sure we solve it the right way.
Hrm I'm a bit confused, on the input path, the channel a tool call is rendered onto is decided by our code, which should always be commentary ATM. The more nefarious thing here is that any message to the I think for now the most reasonable way to do it is that on the input path, if the tool name starts with I think if recipient is set on the harmony message on the output path, it is fair to treat it as a tool call no matter what, doesn't matter which channel it is on for sure though |
You are right and my description was not really correct mixing the rendering with the output. The rendering "issue" (not sure if it's really an issue) is that harmony renders the recipient before the channel token. So when we use a commentary message, it will first render the recipient and then the channel token: This seems what leads to the model switching to output tools in the analysis channel (well I am speculating here, so maybe that's not right) and what this PR is trying to solve. However, even if we could change that, tools may still appear in any channel and we would want to handle them in any case. So it doesn't affect this PR I would say and I am only attempting to explain what might be happening here. Also, I don't know if such a change to the rendering is a good idea as it may affect model performance (assuming harmony was used during training; maybe negligible impact maybe not) and prefix caching (if the model constantly outputs x and harmony renders y then the cache is gone from that location). So a decision whether to change it is more complex and should take that into account.
Didn't know that, but I think
Sounds reasonable to me.
Makes sense! Do you mean changing the solution here or keeping as is? Also see the code for non-streaming in OpenAIToolParser, which is what you describe. |
|
Based on my observations from conducting various tests, although I'm not certain, I believe that built-in tool calls and function tool calls are distinguished by the tool invocations written before and after 'channel'. For example, the structure '<|start|>assistant to=' calls built-in tools, while the structure '<|channel|>commentary to=' proceeds with function tool calls. It appears that confusion occurs when function tools are provided to the gpt-oss model using the built-in tool call structure. |
|
Ah, thanks @levunet, that might explain things more: the built-in tools are in the analysis channel (as per docs) and usually done with tool invocation before channel as you found out. If now harmony renders all tool calls that way without distinguishing, then the model may also switch to invocation before channel and because of that then also switches to emitting them in the analysis channel. That would mean the ideal tool rendering (for the current model) would also distinguish between the channel and place the recipient accordingly and that way keep the output from the model as is (avoiding prefix cache cut-off for that message) and avoid the model confusion. Probably not a very import fix but could be useful. Given that it only affects tool calls, the impact on model performance might be very small or not there. In vLLM we might then also make this distinction and generate analysis messages for the internal tools (being all commentary atm I assume). |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@dr75 There seem to be some merge conflicts! Also is this ready for review? |
7f68de3 to
73f5121
Compare
73f5121 to
4540364
Compare
Hey @bbartels! I resolved the conflicts and made some changes:
So its ready for review. |
Great thanks, can't review myself, but asked in the VLLM slack for a maintainers review! |
70ca2ab to
ef82054
Compare
|
/cc @yeqcharlotte @qandrew PTAL. |
|
@dr75 could you fix the CI issues? thanks! |
|
CI failures seem to be unrelated. All like this |
ef82054 to
4fcb6d2
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
4fcb6d2 to
4c6daaf
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
4c6daaf to
6a061f5
Compare
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
|
Recent test failure is in entrypoints/openai/test_response_api_with_harmony.py::test_function_calling Locally the test is passing, so function call parsing issue due to unstable model behaviour. |
|
@qandrew, @chaunceyjiang, @yeqcharlotte can you please help with merging this PR! |
|
Thanks @dr75 |
Purpose
The gpt-oss streaming parser does not handle tool calls in the analysis channel. However, those calls can happen as explained here.
This especially happens when harmony renders the recipient in previous tool calls
in the analysis channelbefore the channel token,convincing"confusing" the model which returns tool calls in the analysis channel. However, the harmony parser is built to handle both cases, such that the vLLM parser should also support both.The fix proposed in this PR aims at changing the message rendering in harmony (for converting the history to tokens) to avoid confusing the model. While this fixes the issue it seems to not address the root cause: even with such a change, the model may emit tool calls in the analysis channel as described in the OpenAI docs.
To solve this, I propose to handle also tool calls in the analysis channel.
To review changes, you may start at the first commit, which is the actual fix. Following commits are refactoring and testing changes as described below.
Test Plan
serving_chat.pyto a new fileserving_chat_stream_harmony.pyand added a unit test for testing this part in isolation with a test case for testing tools in the analysis channel.