Skip to content

Conversation

@cdoern
Copy link
Collaborator

@cdoern cdoern commented Nov 26, 2025

What does this PR do?

Add support for reasoning fields in OpenAI-compatible chat completion
messages to enable compatibility with vLLM reasoning parsers.

Changes:

  • Add reasoning_content and reasoning fields to OpenAIAssistantMessageParam
  • Add reasoning field to OpenAIChoiceDelta (reasoning_content already existed)

Both field names are supported for maximum compatibility:

  • reasoning_content: Used by vLLM ≤ v0.8.4
  • reasoning: New field name in vLLM ≥ v0.9.x
    (based on release notes)

vLLM documentation recommends migrating to the shorter reasoning field
name, but maintains backward compatibility with reasoning_content.

These fields allow reasoning models to return their chain-of-thought
process alongside the final answer, which is crucial for transparency
and debugging with reasoning models.

References:

Test Plan

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --reasoning-parser deepseek_r1
  
llama stack run starter

curl http://localhost:8321/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
      "messages": [
        {"role": "user", "content": "What is 25 * 4?"}
      ]
    }'

{"id":"chatcmpl-9df9d2a5f849bbe0","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"\n\nTo calculate \\(25 \\times 4\\), follow these easy steps:\n\n1. **Understand the Multiplication:**\n   \n   \\(25 \\times 4\\) means you are adding the number 25 four times.\n   \n   \\[\n   25 + 25 + 25 + 25 = 100\n   \\]\n\n2. **Break Down the Multiplication:**\n   \n   - Multiply 25 by 2:\n     \\[\n     25 \\times 2 = 50\n     \\]\n   - Then multiply the result by 2:\n     \\[\n     50 \\times 2 = 100\n     \\]\n\n3. **Final Answer:**\n   \n   \\[\n   \\boxed{100}\n   \\]","refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"reasoning":"To solve 25 multiplied by 4, I start by recognizing that 25 is a quarter of 100. Multiplying 25 by 4 is the same as finding a quarter of 100 multiplied by 4, which equals 100.\n\nNext, I can consider that 25 multiplied by 4 is also equal to 25 multiplied by 2, which is 50, and then multiplied by 2 again, resulting in 100.\n\nAlternatively, I can use the distributive property by breaking down 4 into 3 and 1, so 25 multiplied by 3 is 75, and 25 multiplied by 1 is 25. Adding these together gives 100.\n\nBoth methods lead to the same result, confirming that 25 multiplied by 4 equals 100.\n","reasoning_content":"To solve 25 multiplied by 4, I start by recognizing that 25 is a quarter of 100. Multiplying 25 by 4 is the same as finding a quarter of 100 multiplied by 4, which equals 100.\n\nNext, I can consider that 25 multiplied by 4 is also equal to 25 multiplied by 2, which is 50, and then multiplied by 2 again, resulting in 100.\n\nAlternatively, I can use the distributive property by breaking down 4 into 3 and 1, so 25 multiplied by 3 is 75, and 25 multiplied by 1 is 25. Adding these together gives 100.\n\nBoth methods lead to the same result, confirming that 25 multiplied by 4 equals 100.\n"},"stop_reason":null,"token_ids":null}],"created":1764187386,"model":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":356,"prompt_tokens":14,"total_tokens":370,"completion_tokens_details":null,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null,"metrics":[{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089063Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"prompt_tokens","value":14,"unit":"tokens"},{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089072Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"completion_tokens","value":356,"unit":"tokens"},{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089075Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"total_tokens","value":370,"unit":"tokens"}]}%

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 26, 2025
@cdoern cdoern marked this pull request as ready for review November 26, 2025 20:06
@cdoern
Copy link
Collaborator Author

cdoern commented Nov 26, 2025

this one can wait until CI is back, want to make sure this doesnt break engines which don't support the field.

Add support for reasoning fields in OpenAI-compatible chat completion
messages to enable compatibility with vLLM reasoning parsers.

Changes:
- Add `reasoning_content` and `reasoning` fields to OpenAIAssistantMessageParam
- Add `reasoning` field to OpenAIChoiceDelta (reasoning_content already existed)

Both field names are supported for maximum compatibility:
- `reasoning_content`: Used by vLLM ≤ v0.8.4
- `reasoning`: New field name in vLLM ≥ v0.9.x

vLLM documentation recommends migrating to the shorter `reasoning` field
name, but maintains backward compatibility with `reasoning_content`.

These fields allow reasoning models to return their chain-of-thought
process alongside the final answer, which is crucial for transparency
and debugging with reasoning models.

References:
- vLLM Reasoning Outputs: https://docs.vllm.ai/en/stable/features/reasoning_outputs/
- vLLM Issue #12468: vllm-project/vllm#12468

Signed-off-by: Charlie Doern <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat: add reasoning and reasoning_content fields to OpenAI message types

Edit this comment to update it. It will appear in the SDK's changelogs.

⚠️ llama-stack-client-node studio · code · diff

There was a regression in your SDK.
generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/llama-stack-client-node/70bd51e76d5f76cbc45e6601ae9bca4c451e5ea0/dist.tar.gz
New diagnostics (5 warning)

⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `InputOpenAIResponseMessageOutput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`
⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `InputListOpenAIResponseMessageUnionOpenAIResponseInputFunctionToolCallOutputOpenAIResponseMessageInput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`
⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `DataOpenAIResponseMessageOutput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`
⚠️ Python/NameNotAllowed: Encountered response property `model_type` which may conflict with Pydantic properties.

Pydantic uses model_ as a protected namespace that shouldn't be used for attributes of our own API's models.
Please rename it using the 'renameValue' transform.

⚠️ Python/NameNotAllowed: Encountered response property `model_type` which may conflict with Pydantic properties.

Pydantic uses model_ as a protected namespace that shouldn't be used for attributes of our own API's models.
Please rename it using the 'renameValue' transform.

⚠️ llama-stack-client-kotlin studio · code · diff

There was a regression in your SDK.
generate ⚠️lint ❗ (prev: lint ✅) → test ❗

llama-stack-client-go studio · code · diff

Your SDK built successfully.
generate ⚠️lint ❗test ❗

go get github.com/stainless-sdks/llama-stack-client-go@64d5dda5cfc86540ad8fdc7fcb937cbe9f93e9ce
llama-stack-client-python studio · code · diff

generate ⚠️build ⏳lint ⏳test ⏳

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-01 21:24:47 UTC

type: array
- type: 'null'
nullable: true
reasoning_content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdoern can we get away by adding just the reasoning field? while we are doing this, can you think about how we should add Gemini-3's "encrypted thought summaries" field also?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I can probably just use reasoning, I can look into the gemini field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm actually @ashwinb , OpenAIChoiceDelta (used for streaming) already has reasoning_content. so should we mimic that and support both reasoning and reasoning_content for streaming/non-streaming?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added support for thought_signatures b99eb2a

pulled this from gemini docs: https://ai.google.dev/gemini-api/docs/thought-signatures

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a somewhat non-trivial decision to make here in terms of the API shape for reasoning. Gemini chooses extra_content to transport all new fields, whereas vLLM has clearly gone another way and used the fact that OpenAI's SDK uses "TypedDicts" which are amenable to adding sub-fields directly in other places (reasoning_content). I somewhat like vLLM's decision better and would rather add two sub-fields: (1) reasoning and (2) thought_signature to both the streaming (ChunkDelta) and non-streaming (Chunk) fields.

Do folks who have worked closer to inference have thoughts here @mattf @bbrowning?

Copy link
Collaborator Author

@cdoern cdoern Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I am fine with either approach here. One half-baked idea I have is that each provider could have a custom version of this class that extends it vLLMOpenAIMessageParams(OpenAIMessageParams) or something and we could add custom content there somehow?

the gemini docs outline their thought_signature support as:

{
      "role": "model",
        "tool_calls": [
          {
            "extra_content": {
              "google": {
                "thought_signature": "<Signature A>"
              }
            },
            "function": {
              "arguments": "{\"flight\":\"AA100\"}",
              "name": "check_flight"
            },
            "id": "function-call-1",
            "type": "function"
          }
        ]
    }

so I wonder if this will work without extra_content?

according to gemini docs: https://ai.google.dev/gemini-api/docs/thought-signatures

thought_signature lives in the extra_content of the tool_call, add extra_content to OpenAIChatCompletionToolCall to support this

Signed-off-by: Charlie Doern <[email protected]>
@cdoern cdoern changed the title feat: add reasoning and reasoning_content fields to OpenAI message types feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types Dec 1, 2025
Copy link
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few things upfront -

  • please clarify the title by mentioning this is for /v1/chat/completions. that will make it more clear to anyone reading the release notes.
  • vllm 0.9.0 was released in may 2025, let's encourage users to upgrade to it and add backward compat on request. (aside: it looks like we test against 0.9.2)

that said, i don't think we should do this.

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

they can convert -

curl http://localhsot:8321/v1/chat/completions \
  -d '{
    "model": "a-reasoning-model",
    "messages": [
      {
        "role": "user",
        "content": "what is 0.857142 (repeating) as a fraction?"
      }
    ]
  }'

into -

$ curl http://localhost:8321/v1/responses \
  -d '{ 
    "reasoning": {"effort": "high", "summary": "detailed"},
    "model": "a-reasoning-model",
    "input": [
      {
        "role": "user",
        "content": "what is 0.857142 (repeating) as a fraction?"
      }
    ]
  }'

@cdoern
Copy link
Collaborator Author

cdoern commented Dec 2, 2025

@mattf:

vllm 0.9.0 was released in may 2025, let's encourage users to upgrade to it and add backward compat on request

this is fair, so does that mean I should only do reasoning here if we do this?

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

@cdoern cdoern changed the title feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types feat: add reasoning, reasoning_content, extra_content fields to OpenAI message types for v1/chat/completions Dec 2, 2025
@mattf
Copy link
Collaborator

mattf commented Dec 2, 2025

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

we allow users to pass in params that we simply forward to the backend inference provider. we could formally do the same for output. at least the openai-python sdk will let you do this.

$ nc -l 8000 <<EOF
HTTP/1.1 200 OK
Content-Type: application/json

{
  "id": "fake-123",
  "object": "chat.completion",
  "model": "ignored",
  "choices": [{"index": 0, "message": {"role": "assistant", "content": "Hello from the fake model!"}, "finish_reason": "stop"}],
  "extra_info": {"foo": "bar", "latency_ms": 12}
}
EOF
$ uv run --with openai python -c 'from openai import OpenAI; response = OpenAI(base_url="http://127.0.0.1:8000", api_key="dummy").chat.completions.create(model="ignored", messages=[{"role": "user", "content": "Hello?"}]); print(response.extra_info)'
{'foo': 'bar', 'latency_ms': 12}

@bbrowning
Copy link
Collaborator

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

Here are two principles I believe are important for widespread Llama Stack adoption in the inference space:

  1. Running a benchmark, application, or agent directly against an inference provider compared to running that same benchmark/app/agent against Llama Stack proxying to that backend inference provider should not result in a loss of model accuracy for the use-case in question. To put it another way, there should not be an accuracy penalty to using Llama Stack compared to using the inference provider directly without Llama Stack.

  2. I also believe that, at least for OpenAI-compatible inference APIs (Completions, Chat Completions, Responses), that clients should be able to talk directly to the backend inference provider or to Llama Stack proxying that inference provider without changes. I shouldn't have to change my query parameters or get different fields returned in the response, for example.

It's out of these principles that I support Llama Stack being able to accept any Chat Completions input that a backend inference provider can handle and return any response fields the backend inference provider can return. And, specifically in this case, that means getting reasoning content in responses and being able to send that reasoning content back in subsequent requests to Chat Completions, as is required to get maximal multi-turn accuracy with reasoning models.

Asking users to instead rewrite their app/agent to use Responses violates principle 2 above, so I do not think that should be our answer.

@mattf
Copy link
Collaborator

mattf commented Dec 2, 2025

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.
Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

Here are two principles I believe are important for widespread Llama Stack adoption in the inference space:

1. Running a benchmark, application, or agent directly against an inference provider compared to running that same benchmark/app/agent against Llama Stack proxying to that backend inference provider should not result in a loss of model accuracy for the use-case in question. To put it another way, there should not be an accuracy penalty to using Llama Stack compared to using the inference provider directly without Llama Stack.

2. I also believe that, at least for OpenAI-compatible inference APIs (Completions, Chat Completions, Responses), that clients should be able to talk directly to the backend inference provider or to Llama Stack proxying that inference provider without changes. I shouldn't have to change my query parameters or get different fields returned in the response, for example.

It's out of these principles that I support Llama Stack being able to accept any Chat Completions input that a backend inference provider can handle and return any response fields the backend inference provider can return. And, specifically in this case, that means getting reasoning content in responses and being able to send that reasoning content back in subsequent requests to Chat Completions, as is required to get maximal multi-turn accuracy with reasoning models.

Asking users to instead rewrite their app/agent to use Responses violates principle 2 above, so I do not think that should be our answer.

those are great principles. (2) especially makes sense for compliant APIs.

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

@cdoern
Copy link
Collaborator Author

cdoern commented Dec 2, 2025

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api

While this is true, I don't think people would be broken by these optional extensions. Like if ollama doesn't support reasoning, a user does not need to pass it in or receive the output but can still use this OpenAI message type.

The more I talk about this though, maybe the better solution here is to not have these at the top level of our inference API and instead do some sort of provider specific inherited version of these types? I can think though what that'd look like if folks would prefer @mattf @bbrowning ?

@bbrowning
Copy link
Collaborator

those are great principles. (2) especially makes sense for compliant APIs.

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

This is a good point, and I'd generally agree that being able to pass through additional parameters from clients to backend inference and from backend inference to clients does not mean that we have to expose those additional parameters as first-class parts of our own API surface. This would lead to us having to expose an API that is a superset of every possible backend provider's special parameters.

However, it does mean that our Pydantic validation has to be loose enough to accept things like arbitrary (or provider-specific, as Charlie suggests) extra fields on messages in Chat Completion requests and pass back arbitrary extra fields in Chat Completion responses.

@mattf
Copy link
Collaborator

mattf commented Dec 3, 2025

those are great principles. (2) especially makes sense for compliant APIs.
we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

This is a good point, and I'd generally agree that being able to pass through additional parameters from clients to backend inference and from backend inference to clients does not mean that we have to expose those additional parameters as first-class parts of our own API surface. This would lead to us having to expose an API that is a superset of every possible backend provider's special parameters.

However, it does mean that our Pydantic validation has to be loose enough to accept things like arbitrary (or provider-specific, as Charlie suggests) extra fields on messages in Chat Completion requests and pass back arbitrary extra fields in Chat Completion responses.

spot on. the unvalidated i/o path opens the user to more risk by helping tie apps to a specific stack configurations.

i'm -0 on this. if someone is going to do it, please take the unvalidate path so we avoid codifying provider specific implementation details in the public api.

hopefully the user is ok moving to /v1/responses.

@ashwinb
Copy link
Contributor

ashwinb commented Dec 3, 2025

Summarizing this discussion so far:

  1. we should not update our public API to add reasoning details in chat completions
  2. while still maintaining (1), we should allow for the transport for provider-specific fields (for example to vLLM and Gemini as specific examples) in both directions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants