[Bug] Fix gpt-oss missing tool content#24954
[Bug] Fix gpt-oss missing tool content#24954levunet wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix a bug in gpt-oss model tool calls by adding 'analysis' content and with_recipient. The changes are logical and align with the stated purpose. However, I've identified a high-severity issue where the handling of 'analysis' content is not robust for multimodal inputs, which could lead to a runtime crash. I have provided a code suggestion to address this.
vllm/entrypoints/harmony_utils.py
Outdated
There was a problem hiding this comment.
The logic to extract content is not robust. The content of an assistant message can be a list of parts (e.g., for multimodal inputs), not just a string. The current implementation content = chat_msg.get("content") or "" will cause a runtime error if content is a non-empty list, as Message.from_role_and_content expects a string. The code should handle the case where content is a list by extracting the text parts, similar to how it's handled for other message types in this file.
content = chat_msg.get("content")
if isinstance(content, list):
# Extract text from multimodal content
content = "\n".join(
p.get("text", "") for p in content
if isinstance(p, dict) and p.get("type") == "text")
elif not isinstance(content, str):
content = ""
analysis_msg = Message.from_role_and_content(Role.ASSISTANT, content)
analysis_msg = analysis_msg.with_channel("analysis")
msgs.append(analysis_msg)c3780bf to
eed8815
Compare
|
@alecsolder |
eed8815 to
39d6e8e
Compare
|
In your test script, I see you're using the streaming completions endpoint, which I don't think uses the harmony_utils method you modified? I just want to double check I'm reading it right |
|
Also, thanks for mentioning that huggingface tokenizer change, I hadn't seen it and my snapshot was out of date! |
|
Thank you for checking. I've double-checked the part you mentioned. |
|
When this PR will be merged guys? |
39d6e8e to
e8e7579
Compare
The changes include adding 'with_recipient' and the Assistant's 'analysis' content. Without adding this content, there was an issue where the gpt-oss model had a higher probability of outputting abnormal tokens when calling tools. Signed-off-by: kyt <[email protected]>
e8e7579 to
f13f45b
Compare
|
I was delayed today due to some issues while resolving git conflicts. The issue was caused by FlashInfer 0.4.0, which was added in recent commits, causing responses to repeat infinitely in the gpt-oss model. It took quite some time to find a solution. VLLM_USE_FLASHINFER_SAMPLER=0 The problem occurred in the sampling process, and it works normally when the FlashInfer sampler is disabled. |
will the fix need both the VLLM_USE_FLASHINFER_SAMPLER=0 and this pr? or just the flashinfer=0? |
I think this setting is for resolving an issue that occurs in a completely different place from the current PR. After testing multiple times, it seems like the same problem exists even without the current PR, so I think this setting can be used separately. |
|
Any progress here? |
|
Unfortunately, the current changes alone are insufficient to fully resolve the bug - the changes from the harmony repo are essential. Since OpenAI hasn't reviewed that PR yet, should I propose pip installing from my fork branch in the meantime?.. |
|
This PR fixes a bug that occurred when using tool calling with gpt-oss models. However, it requires additional changes from the openai/harmony repo, which unfortunately hasn't been reviewed for 1 months now. Would it be acceptable to temporarily modify the requirements to install from my forked harmony branch until the upstream PR is merged? |
@levunet |
|
@levunet, could you give an example where tool calling does not work in v0.11.1? I couldn't find any so I wonder if this fix and the harmony fix is actually needed. Basically #24768 makes tool calling work for me for chat completions in streaming mode. However, there are two remaining issues:
|
|
Ok, could reproduce the cases now with your script above and confirm that it still happens on v0.11.1 and is fixed with this PR and the harmony change. |
|
This PR isn't actually dependent on openai/harmony#76, is it? The fixes here to include the reasoning text from the previous turn into an input message to the analysis channel and to include the assistant recipient on the tool response seem reasonable and neccessary, even without the changes to the harmony library. |
|
I think you are right, that information seems just missing atm. This is problematic for the prefix cache and probably for the reasoning of the model. But I see the problem that we are missing tests that verify the loop model_output -> parse -> encode -> model_output. Without such tests it is hard to say where we are in terms of correctness. But maybe for this PR it's good enough to verify it's working as expected also without the harmony changes. Together with PR #28139, the test script above should work and also the temporary test in PR #28139. Both are not working on main. Don't know if there are any tests that verify complex tool calls. |
|
There's another related problem here as well I identified by stubbing in a test for this. Our tool response ends up looking this like as a Harmony message:
Note the Here's the example test I'm using that verifies both the changes in this PR as well as uncovered that additional issue: diff --git a/tests/entrypoints/openai/test_serving_chat.py b/tests/entrypoints/openai/test_serving_chat.py
index dd10384a7..6ac94160c 100644
--- a/tests/entrypoints/openai/test_serving_chat.py
+++ b/tests/entrypoints/openai/test_serving_chat.py
@@ -728,3 +728,112 @@ async def test_serving_chat_data_parallel_rank_extraction():
# Verify that data_parallel_rank defaults to None
assert "data_parallel_rank" in mock_engine.generate.call_args.kwargs
assert mock_engine.generate.call_args.kwargs["data_parallel_rank"] is None
+
+
+class TestServingChatMakeRequestWithHarmony:
+ @pytest.fixture()
+ def mock_engine(self):
+ mock_engine = MagicMock(spec=AsyncLLM)
+ mock_engine.get_tokenizer.return_value = get_tokenizer(MODEL_NAME)
+ mock_engine.errored = False
+ mock_engine.model_config = MockModelConfig()
+ mock_engine.processor = MagicMock()
+ mock_engine.io_processor = MagicMock()
+ return mock_engine
+
+ @pytest.fixture()
+ def serving_chat(self, mock_engine):
+ return _build_serving_chat(mock_engine)
+
+ @pytest.mark.asyncio
+ async def test_simple_chat(self, serving_chat):
+ messages = [{"role": "user", "content": "what is 1+1?"}]
+ req = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
+ output_messages, _, _ = serving_chat._make_request_with_harmony(req)
+ assert len(output_messages) == 3
+ assert output_messages[0].author.role == "system"
+ assert output_messages[1].author.role == "developer"
+ assert output_messages[2].author.role == "user"
+ assert output_messages[2].content[0].text == messages[0]["content"]
+
+ @pytest.mark.asyncio
+ async def test_tool_call_response_with_content(self, serving_chat):
+ tools = [
+ {
+ "type": "function",
+ "function": {
+ "name": "get_weather",
+ "description": "Get the weather in a given location",
+ "parameters": {
+ "type": "object",
+ "properties": {
+ "location": {"type": "string"},
+ },
+ "required": ["location"],
+ },
+ },
+ },
+ ]
+ messages = [
+ {
+ "role": "user",
+ "content": "What's the weather like in Paris today?",
+ },
+ {
+ "role": "assistant",
+ "content": "We'll call get_weather.",
+ "tool_calls": [
+ {
+ "id": "call_123",
+ "type": "function",
+ "function": {
+ "name": "get_weather",
+ "arguments": '{"location": "Paris"}',
+ },
+ },
+ ]
+ },
+ {
+ "role": "tool",
+ "tool_call_id": "call_123",
+ "content": "20 degrees Celsius",
+ },
+ ]
+ req = ChatCompletionRequest(model=MODEL_NAME, messages=messages, tools=tools)
+ output_messages, _, _ = serving_chat._make_request_with_harmony(req)
+
+ # temporary debugging output
+ print("Output messages:\n")
+ for m in output_messages:
+ print(m.to_dict())
+
+ assert len(output_messages) == 6
+
+ assert output_messages[0].author.role == "system"
+
+ developer_msg = output_messages[1]
+ assert developer_msg.author.role == "developer"
+ developer_tools = developer_msg.content[0].tools
+ assert developer_tools["functions"].tools[0].name == "get_weather"
+
+ user_msg = output_messages[2]
+ assert user_msg.author.role == "user"
+ assert user_msg.content[0].text == messages[0]["content"]
+
+ analysis_msg = output_messages[3]
+ assert analysis_msg.author.role == "assistant"
+ assert analysis_msg.channel == "analysis"
+ assert analysis_msg.content[0].text == messages[1]["content"]
+
+ assistant_msg = output_messages[4]
+ assert assistant_msg.author.role == "assistant"
+ assert assistant_msg.channel == "commentary"
+ assert assistant_msg.recipient == "functions.get_weather"
+ assert assistant_msg.content[0].text == messages[1]["tool_calls"][0]["function"]["arguments"]
+
+ tool_msg = output_messages[5]
+ assert tool_msg.author.role == "tool"
+ assert tool_msg.author.name == "functions.get_weather"
+ assert tool_msg.channel == "commentary"
+ assert tool_msg.recipient == "assistant"
+ assert tool_msg.content[0].text == messages[2]["content"]That gives us output like: |
|
@levunet If you want to add a test for this, feel free to borrow mine above. You can comment out the one line checking the If that's unclear, here's what I mean: diff --git a/vllm/entrypoints/harmony_utils.py b/vllm/entrypoints/harmony_utils.py
index 7958d0317..3308eb03c 100644
--- a/vllm/entrypoints/harmony_utils.py
+++ b/vllm/entrypoints/harmony_utils.py
@@ -246,6 +246,17 @@ def parse_input_to_harmony_message(chat_msg) -> list[Message]:
tool_calls = chat_msg.get("tool_calls")
if role == "assistant" and tool_calls:
msgs: list[Message] = []
+ content = chat_msg.get("content") or ""
+ if isinstance(content, list):
+ content = "".join(
+ item.get("text", "")
+ for item in content
+ if isinstance(item, dict) and item.get("type") == "text"
+ )
+ if content:
+ analysis_msg = Message.from_role_and_content(Role.ASSISTANT, content)
+ analysis_msg = analysis_msg.with_channel("analysis")
+ msgs.append(analysis_msg)Then there's the issue of addressing the function name problem, which could be added to this but also deferred to a separate PR just to keep the scope down here. Either way, I'd like to get your fixes in as they definitely clean up some areas where we're not properly reconstructing the input as we prompt the gpt-oss models with tool call results. This should improve things for all tool calling usage of these models over the Chat Completions API. |
|
Thanks for diving deep into this and sharing your thoughts along with the test! @bbrowning Regarding the first issue:
When sending data in this format, you'll see that function.name works correctly. That said, I agree that implementing tool_call_id resolution would be a better approach for OpenAI API compatibility. (I recall considering this issue previously but opted for the current implementation due to time constraints.) Regarding the second issue: |
I'm not aware of any clients that would send tool results in using that format? vLLM itself explicitly expects and uses the OpenAI types here to define this: Because we allow extra fields sending in the name wouldn't technically throw an error, but that's also making our API not consistent with the Chat Completions API spec and would require every client and client framework to adapt to this instead of passing in tool output as they normally would using tool_call_id. |
|
Given that this PR isn't actually changing our wrong handling of the tool |
|
/cc @yeqcharlotte PTAL. |
|
I incorporate a version of these changes at #28729 because I needed to be able to do an end-to-end test with the bfcl suite to really evaluate before/after gpt-oss performance on multi-turn tool calling. That PR also fixes a number of other important accuracy issues, and I'll keep an eye on this one and rebase the other if this one goes in. |
Purpose
This PR requires the following PRs to be merged first: #24768 and the harmony lib PR (openai/harmony#76).
The changes include adding 'with_recipient' and the Assistant's 'analysis' content.
Without adding this content, there was an issue where the gpt-oss model had a higher probability of outputting abnormal tokens when calling tools.
Test Plan
gpt-oss_test.py
messages.txt
Run python3 gpt-oss_test.py about 10 times.
Test Result
(before)

(after applying the harmony lib changes)
