Skip to content

[Bug] Fix gpt-oss missing tool content#24954

Open
levunet wants to merge 1 commit intovllm-project:mainfrom
levunet:feat/gptoss-tool-fix
Open

[Bug] Fix gpt-oss missing tool content#24954
levunet wants to merge 1 commit intovllm-project:mainfrom
levunet:feat/gptoss-tool-fix

Conversation

@levunet
Copy link
Contributor

@levunet levunet commented Sep 16, 2025

Purpose

This PR requires the following PRs to be merged first: #24768 and the harmony lib PR (openai/harmony#76).

The changes include adding 'with_recipient' and the Assistant's 'analysis' content.
Without adding this content, there was an issue where the gpt-oss model had a higher probability of outputting abnormal tokens when calling tools.

Test Plan

gpt-oss_test.py
messages.txt

Run python3 gpt-oss_test.py about 10 times.

Test Result

(before)
image

(after applying the harmony lib changes)
image

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug in gpt-oss model tool calls by adding 'analysis' content and with_recipient. The changes are logical and align with the stated purpose. However, I've identified a high-severity issue where the handling of 'analysis' content is not robust for multimodal inputs, which could lead to a runtime crash. I have provided a code suggestion to address this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to extract content is not robust. The content of an assistant message can be a list of parts (e.g., for multimodal inputs), not just a string. The current implementation content = chat_msg.get("content") or "" will cause a runtime error if content is a non-empty list, as Message.from_role_and_content expects a string. The code should handle the case where content is a list by extracting the text parts, similar to how it's handled for other message types in this file.

        content = chat_msg.get("content")
        if isinstance(content, list):
            # Extract text from multimodal content
            content = "\n".join(
                p.get("text", "") for p in content
                if isinstance(p, dict) and p.get("type") == "text")
        elif not isinstance(content, str):
            content = ""

        analysis_msg = Message.from_role_and_content(Role.ASSISTANT, content)
        analysis_msg = analysis_msg.with_channel("analysis")
        msgs.append(analysis_msg)

@levunet levunet force-pushed the feat/gptoss-tool-fix branch 2 times, most recently from c3780bf to eed8815 Compare September 16, 2025 09:19
@levunet
Copy link
Contributor Author

levunet commented Sep 16, 2025

@alecsolder
In the harmony library, the ' to' value should have been encoded as id 316, but occasionally it was incorrectly encoded as a mixture of 220+935 id values, which caused the model to output incorrect tokens.

@alecsolder
Copy link
Contributor

alecsolder commented Sep 18, 2025

In your test script, I see you're using the streaming completions endpoint, which I don't think uses the harmony_utils method you modified? I just want to double check I'm reading it right

@alecsolder
Copy link
Contributor

Also, thanks for mentioning that huggingface tokenizer change, I hadn't seen it and my snapshot was out of date!

@levunet
Copy link
Contributor Author

levunet commented Sep 19, 2025

Thank you for checking. I've double-checked the part you mentioned.
I can confirm that the parse_chat_input I modified in the harmony_utils.py file is being used without any issues through the flow: /v1/chat/completions -> def create_chat_completion -> def _make_request_with_harmony -> def parse_chat_input to parse request.messages.
The streaming endpoint is also correctly using this modified method.

@fouadbakkour
Copy link

When this PR will be merged guys?
we need this fix please.

@youkaichao
Copy link
Member

cc @chaunceyjiang @aarnphm

@levunet levunet force-pushed the feat/gptoss-tool-fix branch from 39d6e8e to e8e7579 Compare October 10, 2025 13:12
The changes include adding 'with_recipient' and the Assistant's 'analysis' content.
Without adding this content, there was an issue where the gpt-oss model had a higher probability of outputting abnormal tokens when calling tools.

Signed-off-by: kyt <[email protected]>
@levunet levunet force-pushed the feat/gptoss-tool-fix branch from e8e7579 to f13f45b Compare October 10, 2025 13:19
@levunet
Copy link
Contributor Author

levunet commented Oct 10, 2025

I was delayed today due to some issues while resolving git conflicts. The issue was caused by FlashInfer 0.4.0, which was added in recent commits, causing responses to repeat infinitely in the gpt-oss model. It took quite some time to find a solution.

VLLM_USE_FLASHINFER_SAMPLER=0

The problem occurred in the sampling process, and it works normally when the FlashInfer sampler is disabled.

@epark001
Copy link

I was delayed today due to some issues while resolving git conflicts. The issue was caused by FlashInfer 0.4.0, which was added in recent commits, causing responses to repeat infinitely in the gpt-oss model. It took quite some time to find a solution.

VLLM_USE_FLASHINFER_SAMPLER=0

The problem occurred in the sampling process, and it works normally when the FlashInfer sampler is disabled.

will the fix need both the VLLM_USE_FLASHINFER_SAMPLER=0 and this pr? or just the flashinfer=0?

@levunet
Copy link
Contributor Author

levunet commented Oct 10, 2025

VLLM_USE_FLASHINFER_SAMPLER=0

I think this setting is for resolving an issue that occurs in a completely different place from the current PR. After testing multiple times, it seems like the same problem exists even without the current PR, so I think this setting can be used separately.

@bbrowning bbrowning mentioned this pull request Oct 10, 2025
5 tasks
@ashgold
Copy link

ashgold commented Oct 21, 2025

Any progress here?
Many people in my department want to use the tool calling feature with gpt-oss, but it seems that feature isn't fully implemented in vLLM yet.

@levunet
Copy link
Contributor Author

levunet commented Oct 21, 2025

Unfortunately, the current changes alone are insufficient to fully resolve the bug - the changes from the harmony repo are essential. Since OpenAI hasn't reviewed that PR yet, should I propose pip installing from my fork branch in the meantime?..

@levunet
Copy link
Contributor Author

levunet commented Oct 21, 2025

@aarnphm @chaunceyjiang

This PR fixes a bug that occurred when using tool calling with gpt-oss models. However, it requires additional changes from the openai/harmony repo, which unfortunately hasn't been reviewed for 1 months now. Would it be acceptable to temporarily modify the requirements to install from my forked harmony branch until the upstream PR is merged?

@ashgold
Copy link

ashgold commented Oct 23, 2025

@aarnphm @chaunceyjiang

This PR fixes a bug that occurred when using tool calling with gpt-oss models. However, it requires additional changes from the openai/harmony repo, which unfortunately hasn't been reviewed for 1 months now. Would it be acceptable to temporarily modify the requirements to install from my forked harmony branch until the upstream PR is merged?

@levunet
Could you provide guidance on how to build using the modified version of the openai-harmony module?
I'd like to build the vLLM image with this PR and the modified openai-harmony module, then test the tool calling functionality.
I have experience building the vLLM v0.11.0 image in my environment.

@dr75
Copy link
Contributor

dr75 commented Oct 31, 2025

@levunet, could you give an example where tool calling does not work in v0.11.1? I couldn't find any so I wonder if this fix and the harmony fix is actually needed.

Basically #24768 makes tool calling work for me for chat completions in streaming mode.

However, there are two remaining issues:

  • using --tool-call-parser openai breaks normal non-streaming chat completions; so this flag must not be used right now (tool calling works without it)
  • tool parsing for non-streaming requests doesn't work in v0.11.0. This also doesn't work in v0.11.1 nor with the fix here and the harmony fix you mentioned.

@dr75
Copy link
Contributor

dr75 commented Oct 31, 2025

Ok, could reproduce the cases now with your script above and confirm that it still happens on v0.11.1 and is fixed with this PR and the harmony change.

@bbrowning
Copy link
Contributor

This PR isn't actually dependent on openai/harmony#76, is it? The fixes here to include the reasoning text from the previous turn into an input message to the analysis channel and to include the assistant recipient on the tool response seem reasonable and neccessary, even without the changes to the harmony library.

@dr75
Copy link
Contributor

dr75 commented Nov 6, 2025

I think you are right, that information seems just missing atm. This is problematic for the prefix cache and probably for the reasoning of the model.

But I see the problem that we are missing tests that verify the loop model_output -> parse -> encode -> model_output.

Without such tests it is hard to say where we are in terms of correctness.

But maybe for this PR it's good enough to verify it's working as expected also without the harmony changes. Together with PR #28139, the test script above should work and also the temporary test in PR #28139. Both are not working on main.

Don't know if there are any tests that verify complex tool calls.

@bbrowning
Copy link
Contributor

There's another related problem here as well I identified by stubbing in a test for this. Our tool response ends up looking this like as a Harmony message:

{'role': <Role.TOOL: 'tool'>, 'name': 'functions.', 'content': [{'type': 'text', 'text': '20 degrees Celsius'}], 'channel': 'commentary', 'recipient': 'assistant'}

Note the name there - it's wrong. This should be functions.get_weather (in my example, because this is the response to my get_weather function) but we're not properly extracting the name, as we need to use the tool_call_id for the response to match with the generated tool call by the model earlier in the message stack to find the name.

Here's the example test I'm using that verifies both the changes in this PR as well as uncovered that additional issue:

diff --git a/tests/entrypoints/openai/test_serving_chat.py b/tests/entrypoints/openai/test_serving_chat.py
index dd10384a7..6ac94160c 100644
--- a/tests/entrypoints/openai/test_serving_chat.py
+++ b/tests/entrypoints/openai/test_serving_chat.py
@@ -728,3 +728,112 @@ async def test_serving_chat_data_parallel_rank_extraction():
     # Verify that data_parallel_rank defaults to None
     assert "data_parallel_rank" in mock_engine.generate.call_args.kwargs
     assert mock_engine.generate.call_args.kwargs["data_parallel_rank"] is None
+
+
+class TestServingChatMakeRequestWithHarmony:
+    @pytest.fixture()
+    def mock_engine(self):
+        mock_engine = MagicMock(spec=AsyncLLM)
+        mock_engine.get_tokenizer.return_value = get_tokenizer(MODEL_NAME)
+        mock_engine.errored = False
+        mock_engine.model_config = MockModelConfig()
+        mock_engine.processor = MagicMock()
+        mock_engine.io_processor = MagicMock()
+        return mock_engine
+
+    @pytest.fixture()
+    def serving_chat(self, mock_engine):
+        return _build_serving_chat(mock_engine)
+
+    @pytest.mark.asyncio
+    async def test_simple_chat(self, serving_chat):
+        messages = [{"role": "user", "content": "what is 1+1?"}]
+        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
+        output_messages, _, _ = serving_chat._make_request_with_harmony(req)
+        assert len(output_messages) == 3
+        assert output_messages[0].author.role == "system"
+        assert output_messages[1].author.role == "developer"
+        assert output_messages[2].author.role == "user"
+        assert output_messages[2].content[0].text == messages[0]["content"]
+
+    @pytest.mark.asyncio
+    async def test_tool_call_response_with_content(self, serving_chat):
+        tools = [
+            {
+                "type": "function",
+                "function": {
+                    "name": "get_weather",
+                    "description": "Get the weather in a given location",
+                    "parameters": {
+                        "type": "object",
+                        "properties": {
+                            "location": {"type": "string"},
+                        },
+                        "required": ["location"],
+                    },
+                },
+            },
+        ]
+        messages = [
+            {
+                "role": "user",
+                "content": "What's the weather like in Paris today?",
+            },
+            {
+                "role": "assistant",
+                "content": "We'll call get_weather.",
+                "tool_calls": [
+                    {
+                        "id": "call_123",
+                        "type": "function",
+                        "function": {
+                            "name": "get_weather",
+                            "arguments": '{"location": "Paris"}',
+                        },
+                    },
+                ]
+            },
+            {
+                "role": "tool",
+                "tool_call_id": "call_123",
+                "content": "20 degrees Celsius",
+            },
+        ]
+        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages, tools=tools)
+        output_messages, _, _ = serving_chat._make_request_with_harmony(req)
+
+        # temporary debugging output
+        print("Output messages:\n")
+        for m in output_messages:
+            print(m.to_dict())
+
+        assert len(output_messages) == 6
+
+        assert output_messages[0].author.role == "system"
+
+        developer_msg = output_messages[1]
+        assert developer_msg.author.role == "developer"
+        developer_tools = developer_msg.content[0].tools
+        assert developer_tools["functions"].tools[0].name == "get_weather"
+
+        user_msg = output_messages[2]
+        assert user_msg.author.role == "user"
+        assert user_msg.content[0].text == messages[0]["content"]
+
+        analysis_msg = output_messages[3]
+        assert analysis_msg.author.role == "assistant"
+        assert analysis_msg.channel == "analysis"
+        assert analysis_msg.content[0].text == messages[1]["content"]
+
+        assistant_msg = output_messages[4]
+        assert assistant_msg.author.role == "assistant"
+        assert assistant_msg.channel == "commentary"
+        assert assistant_msg.recipient == "functions.get_weather"
+        assert assistant_msg.content[0].text == messages[1]["tool_calls"][0]["function"]["arguments"]
+
+        tool_msg = output_messages[5]
+        assert tool_msg.author.role == "tool"
+        assert tool_msg.author.name == "functions.get_weather"
+        assert tool_msg.channel == "commentary"
+        assert tool_msg.recipient == "assistant"
+        assert tool_msg.content[0].text == messages[2]["content"]

That gives us output like:

tests/entrypoints/openai/test_serving_chat.py::TestServingChatMakeRequestWithHarmony::test_tool_call_response_with_content Output messages:

{'role': <Role.SYSTEM: 'system'>, 'name': None, 'content': [{'model_identity': 'You are ChatGPT, a large language model trained by OpenAI.', 'reasoning_effort': <ReasoningEffort.MEDIUM: 'Medium'>, 'conversation_start_date': '2025-11-07', 'knowledge_cutoff': '2024-06', 'channel_config': {'valid_channels': ['analysis', 'commentary', 'final'], 'channel_required': True}, 'type': 'system_content'}]}
{'role': <Role.DEVELOPER: 'developer'>, 'name': None, 'content': [{'tools': {'functions': {'name': 'functions', 'tools': [{'name': 'get_weather', 'description': 'Get the weather in a given location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string'}}, 'required': ['location']}}]}}, 'type': 'developer_content'}]}
{'role': <Role.USER: 'user'>, 'name': None, 'content': [{'type': 'text', 'text': "What's the weather like in Paris today?"}]}
{'role': <Role.ASSISTANT: 'assistant'>, 'name': None, 'content': [{'type': 'text', 'text': "We'll call get_weather."}], 'channel': 'analysis'}
{'role': <Role.ASSISTANT: 'assistant'>, 'name': None, 'content': [{'type': 'text', 'text': '{"location": "Paris"}'}], 'channel': 'commentary', 'recipient': 'functions.get_weather', 'content_type': 'json'}
{'role': <Role.TOOL: 'tool'>, 'name': 'functions.', 'content': [{'type': 'text', 'text': '20 degrees Celsius'}], 'channel': 'commentary', 'recipient': 'assistant'}
...
        tool_msg = output_messages[5]
        assert tool_msg.author.role == "tool"
>       assert tool_msg.author.name == "functions.get_weather"
E       AssertionError: assert 'functions.' == 'functions.get_weather'
E         
E         - functions.get_weather
E         + functions.

tests/entrypoints/openai/test_serving_chat.py:836: AssertionError

@bbrowning
Copy link
Contributor

bbrowning commented Nov 7, 2025

@levunet If you want to add a test for this, feel free to borrow mine above. You can comment out the one line checking the tool_msg.author.name and remove the temporary debug output in the test and it should pass with your changes here. It's probably also worth addressing the Gemini content about how to concatenate parts, as well as adding an additional check to not include the analysis message if there is no content there.

If that's unclear, here's what I mean:

diff --git a/vllm/entrypoints/harmony_utils.py b/vllm/entrypoints/harmony_utils.py
index 7958d0317..3308eb03c 100644
--- a/vllm/entrypoints/harmony_utils.py
+++ b/vllm/entrypoints/harmony_utils.py
@@ -246,6 +246,17 @@ def parse_input_to_harmony_message(chat_msg) -> list[Message]:
     tool_calls = chat_msg.get("tool_calls")
     if role == "assistant" and tool_calls:
         msgs: list[Message] = []
+        content = chat_msg.get("content") or ""
+        if isinstance(content, list):
+            content = "".join(
+                item.get("text", "")
+                for item in content
+                if isinstance(item, dict) and item.get("type") == "text"
+            )
+        if content:
+            analysis_msg = Message.from_role_and_content(Role.ASSISTANT, content)
+            analysis_msg = analysis_msg.with_channel("analysis")
+            msgs.append(analysis_msg)

Then there's the issue of addressing the function name problem, which could be added to this but also deferred to a separate PR just to keep the scope down here.

Either way, I'd like to get your fixes in as they definitely clean up some areas where we're not properly reconstructing the input as we prompt the gpt-oss models with tool call results. This should improve things for all tool calling usage of these models over the Chat Completions API.

@levunet
Copy link
Contributor Author

levunet commented Nov 8, 2025

Thanks for diving deep into this and sharing your thoughts along with the test! @bbrowning

Regarding the first issue:
The current implementation retrieves the function name directly from the name field rather than resolving it through tool_call_id. Therefore, the expected data structure in our current codebase is:

{role: 'tool', name: 'get_weather', content: 'data'}

When sending data in this format, you'll see that function.name works correctly. That said, I agree that implementing tool_call_id resolution would be a better approach for OpenAI API compatibility. (I recall considering this issue previously but opted for the current implementation due to time constraints.)

Regarding the second issue:
About handling array-type content for analysis_msg - since gpt-oss models don't generate array-formatted analysis content, I'm not certain this change is necessary. The current string-only handling should suffice for our specific use case with gpt-oss models.

@bbrowning
Copy link
Contributor

Regarding the first issue:
The current implementation retrieves the function name directly from the name field rather than resolving it through tool_call_id. Therefore, the expected data structure in our current codebase is:

{role: 'tool', name: 'get_weather', content: 'data'}

When sending data in this format, you'll see that function.name works correctly. That said, I agree that implementing tool_call_id resolution would be a better approach for OpenAI API compatibility. (I recall considering this issue previously but opted for the current implementation due to time constraints.)

I'm not aware of any clients that would send tool results in using that format? vLLM itself explicitly expects and uses the OpenAI types here to define this:

https://github.com/openai/openai-python/blob/6574bcd612771186995074846253caa0ff1ba517/src/openai/types/chat/chat_completion_tool_message_param.py#L13-L21

Because we allow extra fields sending in the name wouldn't technically throw an error, but that's also making our API not consistent with the Chat Completions API spec and would require every client and client framework to adapt to this instead of passing in tool output as they normally would using tool_call_id.

@bbrowning
Copy link
Contributor

Given that this PR isn't actually changing our wrong handling of the tool name parameters here, I'm still supportive of merging this as-is and handling name separately if that's desired 😄

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Nov 8, 2025
@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Nov 8, 2025
@chaunceyjiang
Copy link
Collaborator

/cc @yeqcharlotte PTAL.

@bbrowning
Copy link
Contributor

I incorporate a version of these changes at #28729 because I needed to be able to do an end-to-end test with the bfcl suite to really evaluate before/after gpt-oss performance on multi-turn tool calling. That PR also fixes a number of other important accuracy issues, and I'll keep an eye on this one and rebase the other if this one goes in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

10 participants