-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types for v1/chat/completions
#4244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
this one can wait until CI is back, want to make sure this doesnt break engines which don't support the field. |
Add support for reasoning fields in OpenAI-compatible chat completion messages to enable compatibility with vLLM reasoning parsers. Changes: - Add `reasoning_content` and `reasoning` fields to OpenAIAssistantMessageParam - Add `reasoning` field to OpenAIChoiceDelta (reasoning_content already existed) Both field names are supported for maximum compatibility: - `reasoning_content`: Used by vLLM ≤ v0.8.4 - `reasoning`: New field name in vLLM ≥ v0.9.x vLLM documentation recommends migrating to the shorter `reasoning` field name, but maintains backward compatibility with `reasoning_content`. These fields allow reasoning models to return their chain-of-thought process alongside the final answer, which is crucial for transparency and debugging with reasoning models. References: - vLLM Reasoning Outputs: https://docs.vllm.ai/en/stable/features/reasoning_outputs/ - vLLM Issue #12468: vllm-project/vllm#12468 Signed-off-by: Charlie Doern <[email protected]>
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs.
|
Pydantic uses |
Pydantic uses |
⚠️ llama-stack-client-kotlin studio · code · diff
There was a regression in your SDK.
generate ⚠️→lint ❗(prev:lint ✅) →test ❗
✅ llama-stack-client-go studio · code · diff
Your SDK built successfully.
generate ⚠️→lint ❗→test ❗go get github.com/stainless-sdks/llama-stack-client-go@64d5dda5cfc86540ad8fdc7fcb937cbe9f93e9ce
⏳ These are partial results; builds are still running.
This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-01 21:24:47 UTC
| type: array | ||
| - type: 'null' | ||
| nullable: true | ||
| reasoning_content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdoern can we get away by adding just the reasoning field? while we are doing this, can you think about how we should add Gemini-3's "encrypted thought summaries" field also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I can probably just use reasoning, I can look into the gemini field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm actually @ashwinb , OpenAIChoiceDelta (used for streaming) already has reasoning_content. so should we mimic that and support both reasoning and reasoning_content for streaming/non-streaming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added support for thought_signatures b99eb2a
pulled this from gemini docs: https://ai.google.dev/gemini-api/docs/thought-signatures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a somewhat non-trivial decision to make here in terms of the API shape for reasoning. Gemini chooses extra_content to transport all new fields, whereas vLLM has clearly gone another way and used the fact that OpenAI's SDK uses "TypedDicts" which are amenable to adding sub-fields directly in other places (reasoning_content). I somewhat like vLLM's decision better and would rather add two sub-fields: (1) reasoning and (2) thought_signature to both the streaming (ChunkDelta) and non-streaming (Chunk) fields.
Do folks who have worked closer to inference have thoughts here @mattf @bbrowning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I am fine with either approach here. One half-baked idea I have is that each provider could have a custom version of this class that extends it vLLMOpenAIMessageParams(OpenAIMessageParams) or something and we could add custom content there somehow?
the gemini docs outline their thought_signature support as:
{
"role": "model",
"tool_calls": [
{
"extra_content": {
"google": {
"thought_signature": "<Signature A>"
}
},
"function": {
"arguments": "{\"flight\":\"AA100\"}",
"name": "check_flight"
},
"id": "function-call-1",
"type": "function"
}
]
}
so I wonder if this will work without extra_content?
according to gemini docs: https://ai.google.dev/gemini-api/docs/thought-signatures thought_signature lives in the extra_content of the tool_call, add extra_content to OpenAIChatCompletionToolCall to support this Signed-off-by: Charlie Doern <[email protected]>
mattf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few things upfront -
- please clarify the title by mentioning this is for /v1/chat/completions. that will make it more clear to anyone reading the release notes.
- vllm 0.9.0 was released in may 2025, let's encourage users to upgrade to it and add backward compat on request. (aside: it looks like we test against 0.9.2)
that said, i don't think we should do this.
we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}
they can convert -
curl http://localhsot:8321/v1/chat/completions \
-d '{
"model": "a-reasoning-model",
"messages": [
{
"role": "user",
"content": "what is 0.857142 (repeating) as a fraction?"
}
]
}'
into -
$ curl http://localhost:8321/v1/responses \
-d '{
"reasoning": {"effort": "high", "summary": "detailed"},
"model": "a-reasoning-model",
"input": [
{
"role": "user",
"content": "what is 0.857142 (repeating) as a fraction?"
}
]
}'
this is fair, so does that mean I should only do
I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this? |
v1/chat/completions
we allow users to pass in params that we simply forward to the backend inference provider. we could formally do the same for output. at least the openai-python sdk will let you do this. |
Here are two principles I believe are important for widespread Llama Stack adoption in the inference space:
It's out of these principles that I support Llama Stack being able to accept any Chat Completions input that a backend inference provider can handle and return any response fields the backend inference provider can return. And, specifically in this case, that means getting reasoning content in responses and being able to send that reasoning content back in subsequent requests to Chat Completions, as is required to get maximal multi-turn accuracy with reasoning models. Asking users to instead rewrite their app/agent to use Responses violates principle 2 above, so I do not think that should be our answer. |
those are great principles. (2) especially makes sense for compliant APIs. we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack. |
While this is true, I don't think people would be broken by these optional extensions. Like if The more I talk about this though, maybe the better solution here is to not have these at the top level of our inference API and instead do some sort of provider specific inherited version of these types? I can think though what that'd look like if folks would prefer @mattf @bbrowning ? |
This is a good point, and I'd generally agree that being able to pass through additional parameters from clients to backend inference and from backend inference to clients does not mean that we have to expose those additional parameters as first-class parts of our own API surface. This would lead to us having to expose an API that is a superset of every possible backend provider's special parameters. However, it does mean that our Pydantic validation has to be loose enough to accept things like arbitrary (or provider-specific, as Charlie suggests) extra fields on messages in Chat Completion requests and pass back arbitrary extra fields in Chat Completion responses. |
spot on. the unvalidated i/o path opens the user to more risk by helping tie apps to a specific stack configurations. i'm -0 on this. if someone is going to do it, please take the unvalidate path so we avoid codifying provider specific implementation details in the public api. hopefully the user is ok moving to /v1/responses. |
|
Summarizing this discussion so far:
|
What does this PR do?
Add support for reasoning fields in OpenAI-compatible chat completion
messages to enable compatibility with vLLM reasoning parsers.
Changes:
reasoning_contentandreasoningfields to OpenAIAssistantMessageParamreasoningfield to OpenAIChoiceDelta (reasoning_content already existed)Both field names are supported for maximum compatibility:
reasoning_content: Used by vLLM ≤ v0.8.4reasoning: New field name in vLLM ≥ v0.9.x(based on release notes)
vLLM documentation recommends migrating to the shorter
reasoningfieldname, but maintains backward compatibility with
reasoning_content.These fields allow reasoning models to return their chain-of-thought
process alongside the final answer, which is crucial for transparency
and debugging with reasoning models.
References:
reasoning_contentin API for reasoning models like DeepSeek R1 vllm-project/vllm#12468Test Plan