feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types for `v1/chat/completions` #4244

cdoern · 2025-11-26T19:52:56Z

What does this PR do?

Add support for reasoning fields in OpenAI-compatible chat completion
messages to enable compatibility with vLLM reasoning parsers.

Changes:

Add reasoning_content and reasoning fields to OpenAIAssistantMessageParam
Add reasoning field to OpenAIChoiceDelta (reasoning_content already existed)

Both field names are supported for maximum compatibility:

reasoning_content: Used by vLLM ≤ v0.8.4
reasoning: New field name in vLLM ≥ v0.9.x
(based on release notes)

vLLM documentation recommends migrating to the shorter reasoning field
name, but maintains backward compatibility with reasoning_content.

These fields allow reasoning models to return their chain-of-thought
process alongside the final answer, which is crucial for transparency
and debugging with reasoning models.

References:

vLLM Reasoning Outputs: https://docs.vllm.ai/en/stable/features/reasoning_outputs/
vLLM Issue #12468: [Feature] reasoning_content in API for reasoning models like DeepSeek R1 vllm-project/vllm#12468

Test Plan

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --reasoning-parser deepseek_r1

llama stack run starter

curl http://localhost:8321/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
      "messages": [
        {"role": "user", "content": "What is 25 * 4?"}
      ]
    }'

{"id":"chatcmpl-9df9d2a5f849bbe0","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"\n\nTo calculate \\(25 \\times 4\\), follow these easy steps:\n\n1. **Understand the Multiplication:**\n   \n   \\(25 \\times 4\\) means you are adding the number 25 four times.\n   \n   \\[\n   25 + 25 + 25 + 25 = 100\n   \\]\n\n2. **Break Down the Multiplication:**\n   \n   - Multiply 25 by 2:\n     \\[\n     25 \\times 2 = 50\n     \\]\n   - Then multiply the result by 2:\n     \\[\n     50 \\times 2 = 100\n     \\]\n\n3. **Final Answer:**\n   \n   \\[\n   \\boxed{100}\n   \\]","refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"reasoning":"To solve 25 multiplied by 4, I start by recognizing that 25 is a quarter of 100. Multiplying 25 by 4 is the same as finding a quarter of 100 multiplied by 4, which equals 100.\n\nNext, I can consider that 25 multiplied by 4 is also equal to 25 multiplied by 2, which is 50, and then multiplied by 2 again, resulting in 100.\n\nAlternatively, I can use the distributive property by breaking down 4 into 3 and 1, so 25 multiplied by 3 is 75, and 25 multiplied by 1 is 25. Adding these together gives 100.\n\nBoth methods lead to the same result, confirming that 25 multiplied by 4 equals 100.\n","reasoning_content":"To solve 25 multiplied by 4, I start by recognizing that 25 is a quarter of 100. Multiplying 25 by 4 is the same as finding a quarter of 100 multiplied by 4, which equals 100.\n\nNext, I can consider that 25 multiplied by 4 is also equal to 25 multiplied by 2, which is 50, and then multiplied by 2 again, resulting in 100.\n\nAlternatively, I can use the distributive property by breaking down 4 into 3 and 1, so 25 multiplied by 3 is 75, and 25 multiplied by 1 is 25. Adding these together gives 100.\n\nBoth methods lead to the same result, confirming that 25 multiplied by 4 equals 100.\n"},"stop_reason":null,"token_ids":null}],"created":1764187386,"model":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":356,"prompt_tokens":14,"total_tokens":370,"completion_tokens_details":null,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null,"metrics":[{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089063Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"prompt_tokens","value":14,"unit":"tokens"},{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089072Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"completion_tokens","value":356,"unit":"tokens"},{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089075Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"total_tokens","value":370,"unit":"tokens"}]}%

cdoern · 2025-11-26T20:16:16Z

this one can wait until CI is back, want to make sure this doesnt break engines which don't support the field.

Add support for reasoning fields in OpenAI-compatible chat completion messages to enable compatibility with vLLM reasoning parsers. Changes: - Add `reasoning_content` and `reasoning` fields to OpenAIAssistantMessageParam - Add `reasoning` field to OpenAIChoiceDelta (reasoning_content already existed) Both field names are supported for maximum compatibility: - `reasoning_content`: Used by vLLM ≤ v0.8.4 - `reasoning`: New field name in vLLM ≥ v0.9.x vLLM documentation recommends migrating to the shorter `reasoning` field name, but maintains backward compatibility with `reasoning_content`. These fields allow reasoning models to return their chain-of-thought process alongside the final answer, which is crucial for transparency and debugging with reasoning models. References: - vLLM Reasoning Outputs: https://docs.vllm.ai/en/stable/features/reasoning_outputs/ - vLLM Issue #12468: vllm-project/vllm#12468 Signed-off-by: Charlie Doern <[email protected]>

github-actions · 2025-12-01T19:33:31Z

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat: add reasoning and reasoning_content fields to OpenAI message types

Edit this comment to update it. It will appear in the SDK's changelogs.

⚠️

llama-stack-client-node studio · code · diff

There was a regression in your SDK.
generate ⚠️ → build ✅ → lint ✅ → test ✅
npm install https://pkg.stainless.com/s/llama-stack-client-node/70bd51e76d5f76cbc45e6601ae9bca4c451e5ea0/dist.tar.gz
New diagnostics (5 warning)

⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `InputOpenAIResponseMessageOutput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`

⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `InputListOpenAIResponseMessageUnionOpenAIResponseInputFunctionToolCallOutputOpenAIResponseMessageInput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`

⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `DataOpenAIResponseMessageOutput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`

⚠️ Python/NameNotAllowed: Encountered response property `model_type` which may conflict with Pydantic properties.
Pydantic uses model_ as a protected namespace that shouldn't be used for attributes of our own API's models.
Please rename it using the 'renameValue' transform.

⚠️ Python/NameNotAllowed: Encountered response property `model_type` which may conflict with Pydantic properties.
Pydantic uses model_ as a protected namespace that shouldn't be used for attributes of our own API's models.
Please rename it using the 'renameValue' transform.

⚠️

llama-stack-client-kotlin studio · code · diff

There was a regression in your SDK.
generate ⚠️ → lint ❗ (prev: lint ✅) → test ❗

✅ llama-stack-client-go studio · code · diff

Your SDK built successfully.
generate ⚠️ → lint ❗ → test ❗
go get github.com/stainless-sdks/llama-stack-client-go@64d5dda5cfc86540ad8fdc7fcb937cbe9f93e9ce

⏳ llama-stack-client-python studio · code · diff

generate ⚠️ → build ⏳ → lint ⏳ → test ⏳

⏳ These are partial results; builds are still running.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-01 21:24:47 UTC

ashwinb · 2025-12-01T19:51:10Z

client-sdks/stainless/openapi.yml

            type: array
          - type: 'null'
          nullable: true
+        reasoning_content:


@cdoern can we get away by adding just the reasoning field? while we are doing this, can you think about how we should add Gemini-3's "encrypted thought summaries" field also?

yeah I can probably just use reasoning, I can look into the gemini field

hmmm actually @ashwinb , OpenAIChoiceDelta (used for streaming) already has reasoning_content. so should we mimic that and support both reasoning and reasoning_content for streaming/non-streaming?

added support for thought_signatures b99eb2a

pulled this from gemini docs: https://ai.google.dev/gemini-api/docs/thought-signatures

We have a somewhat non-trivial decision to make here in terms of the API shape for reasoning. Gemini chooses extra_content to transport all new fields, whereas vLLM has clearly gone another way and used the fact that OpenAI's SDK uses "TypedDicts" which are amenable to adding sub-fields directly in other places (reasoning_content). I somewhat like vLLM's decision better and would rather add two sub-fields: (1) reasoning and (2) thought_signature to both the streaming (ChunkDelta) and non-streaming (Chunk) fields.

Do folks who have worked closer to inference have thoughts here @mattf @bbrowning?

yeah, I am fine with either approach here. One half-baked idea I have is that each provider could have a custom version of this class that extends it vLLMOpenAIMessageParams(OpenAIMessageParams) or something and we could add custom content there somehow?

the gemini docs outline their thought_signature support as:

{ "role": "model", "tool_calls": [ { "extra_content": { "google": { "thought_signature": "<Signature A>" } }, "function": { "arguments": "{\"flight\":\"AA100\"}", "name": "check_flight" }, "id": "function-call-1", "type": "function" } ] }

so I wonder if this will work without extra_content?

according to gemini docs: https://ai.google.dev/gemini-api/docs/thought-signatures thought_signature lives in the extra_content of the tool_call, add extra_content to OpenAIChatCompletionToolCall to support this Signed-off-by: Charlie Doern <[email protected]>

mattf

few things upfront -

please clarify the title by mentioning this is for /v1/chat/completions. that will make it more clear to anyone reading the release notes.
vllm 0.9.0 was released in may 2025, let's encourage users to upgrade to it and add backward compat on request. (aside: it looks like we test against 0.9.2)

that said, i don't think we should do this.

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

they can convert -

curl http://localhsot:8321/v1/chat/completions \
  -d '{
    "model": "a-reasoning-model",
    "messages": [
      {
        "role": "user",
        "content": "what is 0.857142 (repeating) as a fraction?"
      }
    ]
  }'

into -

$ curl http://localhost:8321/v1/responses \
  -d '{ 
    "reasoning": {"effort": "high", "summary": "detailed"},
    "model": "a-reasoning-model",
    "input": [
      {
        "role": "user",
        "content": "what is 0.857142 (repeating) as a fraction?"
      }
    ]
  }'

cdoern · 2025-12-02T15:11:29Z

@mattf:

vllm 0.9.0 was released in may 2025, let's encourage users to upgrade to it and add backward compat on request

this is fair, so does that mean I should only do reasoning here if we do this?

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

mattf · 2025-12-02T15:29:22Z

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

we allow users to pass in params that we simply forward to the backend inference provider. we could formally do the same for output. at least the openai-python sdk will let you do this.

$ nc -l 8000 <<EOF
HTTP/1.1 200 OK
Content-Type: application/json

{
  "id": "fake-123",
  "object": "chat.completion",
  "model": "ignored",
  "choices": [{"index": 0, "message": {"role": "assistant", "content": "Hello from the fake model!"}, "finish_reason": "stop"}],
  "extra_info": {"foo": "bar", "latency_ms": 12}
}
EOF

$ uv run --with openai python -c 'from openai import OpenAI; response = OpenAI(base_url="http://127.0.0.1:8000", api_key="dummy").chat.completions.create(model="ignored", messages=[{"role": "user", "content": "Hello?"}]); print(response.extra_info)'
{'foo': 'bar', 'latency_ms': 12}

bbrowning · 2025-12-02T15:32:25Z

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

Here are two principles I believe are important for widespread Llama Stack adoption in the inference space:

Running a benchmark, application, or agent directly against an inference provider compared to running that same benchmark/app/agent against Llama Stack proxying to that backend inference provider should not result in a loss of model accuracy for the use-case in question. To put it another way, there should not be an accuracy penalty to using Llama Stack compared to using the inference provider directly without Llama Stack.
I also believe that, at least for OpenAI-compatible inference APIs (Completions, Chat Completions, Responses), that clients should be able to talk directly to the backend inference provider or to Llama Stack proxying that inference provider without changes. I shouldn't have to change my query parameters or get different fields returned in the response, for example.

It's out of these principles that I support Llama Stack being able to accept any Chat Completions input that a backend inference provider can handle and return any response fields the backend inference provider can return. And, specifically in this case, that means getting reasoning content in responses and being able to send that reasoning content back in subsequent requests to Chat Completions, as is required to get maximal multi-turn accuracy with reasoning models.

Asking users to instead rewrite their app/agent to use Responses violates principle 2 above, so I do not think that should be our answer.

mattf · 2025-12-02T15:48:01Z

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.
Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

Here are two principles I believe are important for widespread Llama Stack adoption in the inference space:
1. Running a benchmark, application, or agent directly against an inference provider compared to running that same benchmark/app/agent against Llama Stack proxying to that backend inference provider should not result in a loss of model accuracy for the use-case in question. To put it another way, there should not be an accuracy penalty to using Llama Stack compared to using the inference provider directly without Llama Stack.

2. I also believe that, at least for OpenAI-compatible inference APIs (Completions, Chat Completions, Responses), that clients should be able to talk directly to the backend inference provider or to Llama Stack proxying that inference provider without changes. I shouldn't have to change my query parameters or get different fields returned in the response, for example.
It's out of these principles that I support Llama Stack being able to accept any Chat Completions input that a backend inference provider can handle and return any response fields the backend inference provider can return. And, specifically in this case, that means getting reasoning content in responses and being able to send that reasoning content back in subsequent requests to Chat Completions, as is required to get maximal multi-turn accuracy with reasoning models.

Asking users to instead rewrite their app/agent to use Responses violates principle 2 above, so I do not think that should be our answer.

those are great principles. (2) especially makes sense for compliant APIs.

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

cdoern · 2025-12-02T16:02:17Z

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api

While this is true, I don't think people would be broken by these optional extensions. Like if ollama doesn't support reasoning, a user does not need to pass it in or receive the output but can still use this OpenAI message type.

The more I talk about this though, maybe the better solution here is to not have these at the top level of our inference API and instead do some sort of provider specific inherited version of these types? I can think though what that'd look like if folks would prefer @mattf @bbrowning ?

bbrowning · 2025-12-02T18:01:36Z

those are great principles. (2) especially makes sense for compliant APIs.

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

This is a good point, and I'd generally agree that being able to pass through additional parameters from clients to backend inference and from backend inference to clients does not mean that we have to expose those additional parameters as first-class parts of our own API surface. This would lead to us having to expose an API that is a superset of every possible backend provider's special parameters.

However, it does mean that our Pydantic validation has to be loose enough to accept things like arbitrary (or provider-specific, as Charlie suggests) extra fields on messages in Chat Completion requests and pass back arbitrary extra fields in Chat Completion responses.

mattf · 2025-12-03T12:28:37Z

those are great principles. (2) especially makes sense for compliant APIs.
we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

This is a good point, and I'd generally agree that being able to pass through additional parameters from clients to backend inference and from backend inference to clients does not mean that we have to expose those additional parameters as first-class parts of our own API surface. This would lead to us having to expose an API that is a superset of every possible backend provider's special parameters.

However, it does mean that our Pydantic validation has to be loose enough to accept things like arbitrary (or provider-specific, as Charlie suggests) extra fields on messages in Chat Completion requests and pass back arbitrary extra fields in Chat Completion responses.

spot on. the unvalidated i/o path opens the user to more risk by helping tie apps to a specific stack configurations.

i'm -0 on this. if someone is going to do it, please take the unvalidate path so we avoid codifying provider specific implementation details in the public api.

hopefully the user is ok moving to /v1/responses.

ashwinb · 2025-12-03T18:47:17Z

Summarizing this discussion so far:

we should not update our public API to add reasoning details in chat completions
while still maintaining (1), we should allow for the transport for provider-specific fields (for example to vLLM and Gemini as specific examples) in both directions.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 26, 2025

cdoern force-pushed the completion branch from a43b229 to 265a871 Compare November 26, 2025 20:06

cdoern marked this pull request as ready for review November 26, 2025 20:06

cdoern requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners November 26, 2025 20:06

cdoern force-pushed the completion branch from 265a871 to b1851dc Compare December 1, 2025 19:32

ashwinb reviewed Dec 1, 2025

View reviewed changes

cdoern changed the title ~~feat: add reasoning and reasoning_content fields to OpenAI message types~~ feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types Dec 1, 2025

mattf reviewed Dec 2, 2025

View reviewed changes

cdoern changed the title ~~feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types~~ feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types for v1/chat/completions Dec 2, 2025

feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types for v1/chat/completions #4244

Are you sure you want to change the base?

feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types for v1/chat/completions #4244

Conversation

cdoern commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

cdoern commented Nov 26, 2025

Uh oh!

github-actions bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

ashwinb Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ashwinb Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

cdoern commented Dec 2, 2025

Uh oh!

mattf commented Dec 2, 2025

Uh oh!

bbrowning commented Dec 2, 2025

Uh oh!

mattf commented Dec 2, 2025

Uh oh!

cdoern commented Dec 2, 2025

Uh oh!

bbrowning commented Dec 2, 2025

Uh oh!

mattf commented Dec 3, 2025

Uh oh!

ashwinb commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types for `v1/chat/completions` #4244

feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types for `v1/chat/completions` #4244

cdoern commented Nov 26, 2025 •

edited

Loading

github-actions bot commented Dec 1, 2025 •

edited

Loading

cdoern Dec 1, 2025 •

edited

Loading