[Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls#25514
[Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls#25514mgoin merged 2 commits intovllm-project:mainfrom
Conversation
This fixes an issue identified in vllm-project#22337 that was not entirely fixed by vllm-project#22386 around tool call parsing in gpt-oss models in the non-streaming case and the model output returned to the user. The main change here is to remove the parsed tool call out of the ChatCompletion message `content`, so that the generated tool call only shows up in its parsed form as an entry in `tool_calls` instead of in both `tool_calls` and `content`. A small related change is to ensure we're not sending newline characters in the JSON arguments string of the parsed tool call, since we don't do this for other tool calls. Together these should get non-streaming tool calls via the Chat Completions API working for gpt-oss models for typical use-cases. There may still be some edge cases the openai_tool_parser.py and/or serving_chat.py code need to handle, but this at least closes some known gaps. A couple of unit tests were tweaked here to test for both of these fixes. I ensured the tests failed before my fixes, and that they now pass with the fixes. Signed-off-by: Ben Browning <bbrownin@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a bug in non-streaming tool call parsing for gpt-oss models by ensuring tool call data is not duplicated in the content field and by normalizing the JSON in tool call arguments. The changes in the tests and application logic are sound. However, I've identified a critical issue in openai_tool_parser.py where the new JSON parsing logic could lead to server crashes due to unhandled exceptions for empty or invalid JSON content. I've provided a suggestion to make the implementation more robust.
This adds a few more edge case tests and more defensive checks in openai_tool_parser.py to ensure we handle cases where the model may produce invalid JSON or not JSON at all. This also adds a test to verify we properly return the final content of the response if one was supplied - that worked before, but there was no test to confirm that. And, a few more tests for edges cases like no arguments or empty arguments handling. Signed-off-by: Ben Browning <bbrownin@redhat.com>
|
I pushed an additional commit here to add more tests and checks of the openai tool parser, handling cases such as non-json content types, invalid json, empty arguments, no arguments, and tool_call responses that also return content via the |
…lm-project#25514) Signed-off-by: Ben Browning <bbrownin@redhat.com>
|
@bbrowning non streaming seems to be working great! But in streaming mode (not sure if this is suppose to fix it) tool calling doesn't work most of the time :-( |
|
@alew3 This PR did not change the streaming code, but I'm more than happy to take a look. Can you point me to an issue or open a new one with what you're seeing on the streaming side of things? |
|
Thanks for this, it is really helpful!! @bbrowning here an example : The curl: The response: It is weird because sometimes it works and sometimes it don't, so I think there is something breaking the parse. |
|
@PedroMiolaSilva Thanks, looking into that. For me, your example consistently works fine with gpt-oss-120b but fails more often than not on gpt-oss-20b. On the surface, that makes me suspect the smaller model is outputting something wrong here in the 20b case, but will dig a bit further to identify exactly what's happening. |
|
Ok, the root of the issue is that with gpt-oss-20b it is sometimes emitting the tool call on the |
|
#24768 should fix this issue, or at least fixes it in my local testing. The root of the problem is in Chat Completions we are not properly enabling the |
|
@bbrowning thanks a lot, will test it here also! Could you share your command so that I can try to reproduce? |
|
@PedroMiolaSilva Sure - I've been testing this with a very simple script like https://gist.github.com/bbrowning/f892ebd29ebb297643633de47e0c8197 that takes your provided tool definitions, messages, and request params and just sends those to a vLLM running locally. It then collects the streaming content, logging the reasoning content, regular content, and any tool calls (which I collect into a dict, so they output as a dict of function name and args string). When things fail, the tool call was missing or stuffed into the reasoning text. Here's an example of the valid output I look for: |
…5514) Signed-off-by: Ben Browning <bbrownin@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…lm-project#25514) Signed-off-by: Ben Browning <bbrownin@redhat.com>
…lm-project#25514) Signed-off-by: Ben Browning <bbrownin@redhat.com>
…lm-project#25514) Signed-off-by: Ben Browning <bbrownin@redhat.com>
Purpose
This fixes an issue identified in #22337 that was not entirely fixed by #22386 around tool call parsing in gpt-oss models in the non-streaming case and the model output returned to the user.
The main change here is to remove the parsed tool call out of the ChatCompletion message
content, so that the generated tool call only shows up in its parsed form as an entry intool_callsinstead of in bothtool_callsandcontent.A small related change is to ensure we're not sending newline characters in the JSON arguments string of the parsed tool call, since we don't do this for other tool calls.
Together these should get non-streaming tool calls via the Chat Completions API working for gpt-oss models for typical use-cases. There may still be some edge cases the openai_tool_parser.py and/or serving_chat.py code need to handle, but this at least closes some known gaps.
Test Plan
A couple of tests were tweaked here to test for both of these fixes. I ensured the tests failed before my fixes, and that they now pass with the fixes.
Here's how I ran the changed tests:
I also manually ran the example shown in #22337, copied here for completeness sake:
This example was run from a recent main of vLLM with my changes from this PR, as below:
Test Result
All tests in the test commands described above passed.
In addition, I ran this manual test taken from #22337 and it now turns the expected content:
ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-7567964c8f8447e89e64b23eecbc7931', function=Function(arguments='{"city": "Berlin"}', name='get_weather'), type='function')], reasoning_content='User asks for current weather in Berlin. We have a function get_weather. Use it.')For reference, here is the old (ie wrong) results from before my change:
ChatCompletionMessage(content='{\n "city": "Berlin"\n}', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-f229827765c04578ad151be5fb69c79c', function=Function(arguments='{\n "city": "Berlin"\n}', name='get_weather'), type='function')], reasoning_content='User asks for current weather in Berlin. We have a function get_weather. We need to call it with city "Berlin".')Note that the wrong content includes the function call within the
contentparameter, which it should not. And see it also includes newline characters within the JSONargumentsstring, which may be ok but is not consistent with the JSON strings we return from other tool call parsers that don't include any newlines.