Chat response parsing #40894

Rocketknight1 · 2025-09-15T17:21:48Z

This PR is a replacement for #39609. The idea is that models can include a message schema, allowing model output to be parsed into a structured form. The original plan was to allow parsing of the entire chat history, essentially the inverse operation of apply_chat_template, but the schemas involved were too complex and there was no realistic hope that users would be able to write them!

This PR simplifies things - we focus only on parsing the output generated by the model. This is mainly relevant for tool calling and chain of thought models, both of which emit structured output that often needs manual handling before it can be appended to the chat.

The output schema is stored as a key on the tokenizer. It consists of a JSON schema, representing the structure of messages emitted by the model, with additional x- keys that indicate how parsing should be performed. Parsing is mostly done through regexes, but there is also support for common tool call formats like json to be directly parsed without you having to write an entire JSON regex parser 😅

Work to do:

Documentation to do:

Expand parse_response explanation and show how it works with TextGenerationPipeline
Move the writing guide to a separate doc, and expand it?
Write the reference for the allowed x- fields

Open questions:

Fold chat_parsing_utils.py into chat_template_utils.py?

HuggingFaceDocBuilderDev · 2025-09-16T14:21:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-09-18T14:46:30Z

cc @zucchini-nlp do you know what the popular VLMs using reasoning or tool calls are right now? I'd like to add support + testing in the processor too.

zucchini-nlp · 2025-09-18T15:04:47Z

From the ones we have already in library, Ovis2 chat templates has some parts for tool usage. Other than that, haven't seen explicit reasoning/tools in template

Rocketknight1 · 2025-09-18T15:56:34Z

Thank you! I'll take a look at Ovis2, if not I can worry about adding Processor schemas when we actually need them

lewtun · 2025-09-23T19:19:38Z

docs/source/en/chat_response_parsing.md

+    "properties": {
+        "role": {"const": "assistant"},
+        "content": {"type": "string"},
+        "thinking": {"type": "string"}


FYI vLLM and LiteLLM use reasoning_content in their parsers. Do we want to use that convention in your examples?

Hmmn, I think we often use thinking because that's the key that chat templates use in their input! The idea here is that the returned dict should be ready to append to the chat history without further user intervention

I see, that makes sense - is that just for gpt-oss or have you seen other models adopt thinking too in their chat templates?

To clarify, my question was about whether returning reasoning_content would provide a drop-in replacement for the vLLM reasoning parsers. No strong opinion either way :)

Now that you mention it, a lot of LLMs drop the thinking block entirely in their template, because they don't render thinking blocks from past turns. We could probably switch to reasoning_content without too much pain!

Also +1 for reasoning_content. But there is also a chance here to standardize

Maybe along the line of https://standardcompletions.org/

qgallouedec

this is very useful for huggingface/trl#4115

qgallouedec · 2025-09-24T13:37:05Z

src/transformers/processing_utils.py


+    def parse_response(self, response: str, schema: Optional[Union[list, dict]] = None):
+        if schema is None:
+            if getattr(self, "response_schema", None) is None:


nit

Suggested change

if getattr(self, "response_schema", None) is None:

if not hasattr(self, "response_schema"):

I think this fails in the case where self.response_schema = None, which we might set at init!

Rocketknight1 · 2025-09-24T15:16:36Z

Made a quick demo so reviewers can try this out. Just switch to the chat_schemas branch and then:

from transformers import pipeline

model_name = "Rocketknight1/qwen-response-test"

pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")

prompt = "Hey, what's 6 * 8?"
messages = [
    {"role": "user", "content": prompt}
]

out = pipe(messages, max_new_tokens=512)
print(out[0]["generated_text"][-1])

If it works, you should see the output correctly split into thinking and content blocks!

And here's a quick demo for tool calling:

from transformers import pipeline

model_name = "Rocketknight1/qwen-response-test"

def get_current_weather(location: str):
    """
    Gets the weather at a given location

    Args:
    location: The location to get the weather for
    """
    return 20.

pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")

prompt = "Hey, what's the weather like in Paris?"
messages = [
    {"role": "user", "content": prompt}
]

out = pipe(messages, tools=[get_current_weather], max_new_tokens=512)
print(out[0]["generated_text"][-1])

The tool call should be correctly parsed as a key in the response dict.

alecsolder · 2025-09-24T17:51:51Z

Hey, I've recently been working on the tool call handling in vLLM and came across this. Lots of really cool stuff going on here!

A couple questions I have:

Would it be possible to support using a library instead of a regex for parsing the output from the model? The Harmony library from OpenAI is pretty complicated already and replicating the logic in a regex would be pretty tough. For example:
- Header parsing has some logic where it splits on whitespace to get the content_type of a tool call
- There are two valid places for recipient to be in the headers
Would it be possible to implement this at the token level instead of after decoding? Doing it at the token level would remove the risk of using text versions of special tokens to break parsing.

bbrowning · 2025-09-24T18:28:50Z

Would it be possible to implement this at the token level instead of after decoding? Doing it at the token level would remove the risk of using text versions of special tokens to break parsing.

To this point, the Harmony library expects to deal directly with token ids, right? So if we allow integration with model provider libraries for this parsing, we may have to support token ids.

aarnphm

hi there, I help with structured outputs and tool calling on vLLM.

I left some comments here. lmk if there is anything we can help with

There is also a wip for tool call parser on our end, that is largely depends on xgrammar structural tag, so just want to make sure that these wouldn't duplicate some of the work here as well.

aarnphm · 2025-09-25T02:09:17Z

docs/source/en/chat_response_parsing.md

+    "properties": {
+        "role": {"const": "assistant"},
+        "content": {"type": "string"},
+        "thinking": {"type": "string"}


Also +1 for reasoning_content. But there is also a chance here to standardize

Maybe along the line of https://standardcompletions.org/

aarnphm · 2025-09-25T02:11:26Z

docs/source/en/chat_response_parsing.md

+Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need
+to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like
+chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to
+save and share the schema. 


my 2cent: I do think it might be beneficial to keep the parser implementation of tokenizer.parse_response to be in huggingface/tokenizers (i.e Rust implementation).

for openai/harmony format, they do seem also very performant

aarnphm · 2025-09-25T02:12:13Z

docs/source/en/chat_response_parsing.md

+
+## Developers: Complex schemas
+
+Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser


you might be also interested

https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html

chaunceyjiang · 2025-09-25T02:57:51Z

docs/source/en/chat_response_parsing.md

+
+## Developers: Understanding a simple response schema
+
+Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents


How should we handle some tool_calls that are in XML format?
For example, Qwen3-Coder.

In general, our inputs/outputs are in JSON schema format, even when models render them in a different format. We expect the input to a chat template to be JSON schema, or equivalent Python, and the decoded output with chat parsing would be as well. This was to enable a consistent API across models.

This is true even when the model does something totally different, like rendering tool calls in XML! In that case, the chat template and parser should translate the standard API to XML and back.

It's likely (assuming we go with this feature and don't replace it with something more like Structural Tag) that we'd add an xml parser to the spec as well, like the json parser that already exists.

Ubospica

Thanks for the reference @aarnphm! This is a really cool and useful PR! We’ve also had discussions with the vLLM team about how to build a unified tool calling parser interface, and I think this will be very helpful as well.

In XGrammar, we previously implemented a Structural Tag whose goal is to describe various kinds of structures. With guided decoding, it ensures the output strictly follows the defined structure, while also leaving room for potential support of parsing output in the future. This will be merged into the vLLM/SGLang/TensorRT main branches within the next two weeks. Its docs can be found at https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html. It also supports constraining output in XML format.

I think that also aligns very well with the goal of this PR. The difference is that this API focuses more on high-level semantics, whereas Structural Tag emphasizes structure at the raw text level. I believe there’s a lot of space for collaboration here, and I’d love to explore this further.

aarnphm · 2025-09-25T04:15:58Z

I think that also aligns very well with the goal of this PR. The difference is that this API focuses more on high-level semantics, whereas Structural Tag emphasizes structure at the raw text level. I believe there’s a lot of space for collaboration here, and I’d love to explore this further.

I also would like to see a push for xgrammar upstream as well here.

Rocketknight1 · 2025-09-25T14:43:11Z

@Ubospica yes, Structural Tag looks very similar to this! Constrained generation and output parsing obviously have a lot of overlap, since they both define an output schema of some kind. Do you know how output parsing was intended to be implemented for Structural Tag?

Rocketknight1 · 2025-09-25T15:40:30Z

Actually, a better question: If we want to align with XGrammar, should we try to extend the Structural Tag spec to allow output parsing, or should we have an output schema for parsing that's separate from the Structural Tag?

Rocketknight1 · 2025-09-26T13:39:58Z

I thought about it for a while and here's what I have:

The main thing response parsing needs that structural tag doesn't have is a way to map segments of the output to paths in the parsed message dict. For example, it is very common that LLMs render tool calls or tool definitions as JSON schema, but not in the standard OpenAI format our API expects for tool calls. For example, they may rename "arguments" to "parameters", or they may leave out the "function" key, etc.

The goal of chat parsing is that the model should return a message dictionary that is ready to be appended to the chat without any further parsing from the user, which means it must be in the standard API format. This means we need a way to map "segments" of the structural tag schema to segments of the output dict. If the model emits:

<tool_call>
{
   "name": "get_current_weather",
   "parameters": {
       "location": "Paris"
    }
}
</tool_call>

The output we want from parsing is:

{
    "role": "assistant",
    "tool_calls": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "arguments": {
                    "location": "Paris"
                }
           }
       ]
}

It's easy to express the model output above as a structural tag schema, but it's trickier to do the mapping from there to the output format we want. It may even be strictly impossible in some cases - some of the structural tag keys like regex and particularly grammar don't seem like they have a good way to separate internal segments. With regex we could have named capturing groups, maybe, but with EBNF grammars it won't work, without some nonstandard additions to the EBNF spec. Even if we could make it work, it seems like it would clutter the structural tag spec, which right now is clean and usable.

Maybe it makes more sense to keep response parsing separate, but I can upstream it or something like it to XGrammar as a separate feature, alongside structural tags for constrained generation?

qgallouedec · 2025-09-27T19:17:59Z

For information, multiple tool calls isn't supported yet:

from transformers import pipeline

model_name = "Rocketknight1/qwen-response-test"

def get_current_weather(location: str):
    """
    Gets the weather at a given location

    Args:
    location: The location to get the weather for
    """
    return 20.

pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")

prompt = "Hey, what's the weather like in Paris and London?"
messages = [
    {"role": "user", "content": prompt}
]

tools = [get_current_weather]
out = pipe(messages, tools=tools, max_new_tokens=512)
print(out[0]["generated_text"][-1])

Traceback (most recent call last):
  File "/fsx/qgallouedec/transformers/src/transformers/utils/chat_parsing_utils.py", line 114, in recursive_parse
    parsed_json = json.loads(node_content)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/json/decoder.py", line 341, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 3 column 1 (char 69)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/fsx/qgallouedec/trl/parsing.py", line 22, in <module>
    out = pipe(messages, tools=tools, max_new_tokens=512)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/pipelines/text_generation.py", line 325, in __call__
    return super().__call__(Chat(text_inputs), **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/pipelines/base.py", line 1289, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/pipelines/base.py", line 1297, in run_single
    outputs = self.postprocess(model_outputs, **postprocess_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/pipelines/text_generation.py", line 530, in postprocess
    assistant_message = self.tokenizer.parse_response(all_text)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/tokenization_utils_base.py", line 1803, in parse_response
    return recursive_parse(response, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/utils/chat_parsing_utils.py", line 173, in recursive_parse
    parsed_schema[key] = recursive_parse(node_content[key], child_node)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/utils/chat_parsing_utils.py", line 116, in recursive_parse
    raise ValueError(
ValueError: Node has JSON parser but could not parse its contents as JSON: 
{"name": "get_current_weather", "arguments": {"location": "Paris"}}
</tool_call>

<tool_call>
{"name": "get_current_weather", "arguments": {"location": "London"}}

Error: Extra data: line 3 column 1 (char 69)

Ubospica · 2025-09-29T06:41:53Z

I thought about it for a while and here's what I have:

The main thing response parsing needs that structural tag doesn't have is a way to map segments of the output to paths in the parsed message dict. For example, it is very common that LLMs render tool calls or tool definitions as JSON schema, but not in the standard OpenAI format our API expects for tool calls. For example, they may rename "arguments" to "parameters", or they may leave out the "function" key, etc.

The goal of chat parsing is that the model should return a message dictionary that is ready to be appended to the chat without any further parsing from the user, which means it must be in the standard API format. This means we need a way to map "segments" of the structural tag schema to segments of the output dict. If the model emits:
<tool_call>
{
   "name": "get_current_weather",
   "parameters": {
       "location": "Paris"
    }
}
</tool_call>
The output we want from parsing is:
{
    "role": "assistant",
    "tool_calls": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "arguments": {
                    "location": "Paris"
                }
           }
       ]
}
It's easy to express the model output above as a structural tag schema, but it's trickier to do the mapping from there to the output format we want. It may even be strictly impossible in some cases - some of the structural tag keys like regex and particularly grammar don't seem like they have a good way to separate internal segments. With regex we could have named capturing groups, maybe, but with EBNF grammars it won't work, without some nonstandard additions to the EBNF spec. Even if we could make it work, it seems like it would clutter the structural tag spec, which right now is clean and usable.

Maybe it makes more sense to keep response parsing separate, but I can upstream it or something like it to XGrammar as a separate feature, alongside structural tags for constrained generation?

Thanks for the great discussion here! Previously xgrammar had a proposal for parsing some output text with the structural tag at mlc-ai/xgrammar#303. The API is like

def parse_with_structural_tag(input_str, structural_tag) -> StructuralTagResult: pass

It's not finished yet, but it should be easy to pick it up recently.

This maps the output text to the structure of the StructuralTag. To further use it in tool calling (and also parallel tool calling), we still need to map it into the OpenAI format (the "tool_calls": format you mentioned).

I agree with you that with regex/ebnf the output is harder to parse. It's possible with xgrammar's builtin Earley parser, but may introduce multiple possible parsing result and we need to determine which one to use. Maybe we can restrict the parser with StructuralTags without any regex/ebnf content.

One possible and suitable solution is, we can have an extra layer of abstraction for tool calling and further lower that into structural tag, and use the structural tag parser to handle that, then convert it to OpenAI tool calling format. The benefit could be: 1) it's easier to handle the different tool calling formats of different models; 2) we can unify the constraint decoding (or guided decoding) and tool calling parsing.

Rocketknight1 · 2025-10-10T14:14:41Z

Quick update here: I experimented with dropping this PR and instead adding parsing support to XGrammar constrained generation schemas. However, after experimentation, I think there's a big difference between the schemas needed for constrained generation and the schemas needed for output parsing. In particular, constrained generation schemas need a lot of knowledge about tool argument names and types if they want to ensure that all tool calls are valid, but a parser can skip all this and simply record the model output with a generic schema that works for all tools.

I think constrained generation schemas are still very useful! But I'm much less confident that it's a good idea to overload them with both tasks; there's actually a lot of friction between them. It probably makes more sense for models to have both a "response schema" and a "generation schema", and this PR will remain focused on the first.

…y chat templates there

…ocessors

…c docs section

Rocketknight1 mentioned this pull request Sep 15, 2025

Chat schemas #39609

Closed

10 tasks

Rocketknight1 marked this pull request as ready for review September 17, 2025 14:33

Rocketknight1 force-pushed the chat_schemas branch 2 times, most recently from 1dfdb73 to 6610d38 Compare September 17, 2025 16:27

Rocketknight1 force-pushed the chat_schemas branch from 584c843 to 73470de Compare September 18, 2025 14:50

Rocketknight1 changed the title ~~Chat output schemas~~ Chat response parsing Sep 22, 2025

Rocketknight1 force-pushed the chat_schemas branch 2 times, most recently from b8c0a94 to ab99161 Compare September 23, 2025 16:28

lewtun reviewed Sep 23, 2025

View reviewed changes

qgallouedec reviewed Sep 24, 2025

View reviewed changes

aarnphm reviewed Sep 25, 2025

View reviewed changes

chaunceyjiang reviewed Sep 25, 2025

View reviewed changes

Ubospica reviewed Sep 25, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

Rocketknight1 force-pushed the chat_schemas branch from ee0a466 to 6a8f095 Compare October 8, 2025 16:06

Rocketknight1 force-pushed the chat_schemas branch from a68bd5b to 56be799 Compare October 10, 2025 14:24

Rocketknight1 added 29 commits October 10, 2025 16:56

Remove accidental newline

60d4b86

Guard tests correctly

fc556fe

Remove require_jinja on the schema tests since we don't actually appl…

7b76324

…y chat templates there

make fixup

8925935

fix some bad linter changes

5b54b47

Fix docstring

8e13d15

Push draft documentation

808a628

Extend tests, more documentation

ad5d2a4

make fixup

968cc6d

docs docs docs

0321b93

Add Processor support

ce43e68

Add to toctree

3107fe5

Flag markdown correctly

cca3216

Remove double backslashes in docs for simplicity

d1808fb

Simplify node-regex-to-dict

e98629b

Add support to ImageTextToTextPipeline

05aa04e

Add support to ImageTextToTextPipeline and save/loading support in Pr…

2c0b076

…ocessors

Begin reworking docs to start fitting in response parsing

bd11548

Fix rebase

5d9e87b

Expand documentation further

d04febf

Expand documentation further

e0c892e

Refactor x-regex-to-dict to x-regex-key-value, update the parser logi…

f17465c

…c docs section

Refactor x-regex-to-dict to x-regex-key-value, update the parser logi…

4229ced

…c docs section

More docs update

89daa6a

Update TextGenerationPipeline to support tools properly

06c1782

Some rebase fixes

b3921b5

Re-add is_jmespath_available

41ad8a8

Re-add is_jmespath_available

2587fa7

Add Qwen3 parser and test, add maybe-json support

58d8965

Rocketknight1 force-pushed the chat_schemas branch from 34bb26b to 58d8965 Compare October 10, 2025 15:57

	if getattr(self, "response_schema", None) is None:
	if not hasattr(self, "response_schema"):


		## Developers: Complex schemas

		Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser


		## Developers: Understanding a simple response schema

		Under the hood, `parse_response` uses a JSON schema to parse the model output. A JSON schema represents

Chat response parsing #40894

Are you sure you want to change the base?

Chat response parsing #40894

Uh oh!

Conversation

Rocketknight1 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 16, 2025

Uh oh!

Rocketknight1 commented Sep 18, 2025

Uh oh!

zucchini-nlp commented Sep 18, 2025

Uh oh!

Rocketknight1 commented Sep 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alecsolder commented Sep 24, 2025

Uh oh!

bbrowning commented Sep 24, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ubospica left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aarnphm commented Sep 25, 2025

Uh oh!

Rocketknight1 commented Sep 25, 2025

Uh oh!

Rocketknight1 commented Sep 25, 2025

Uh oh!

Rocketknight1 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

qgallouedec commented Sep 27, 2025

Uh oh!

Ubospica commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Rocketknight1 commented Sep 15, 2025 •

edited

Loading

Rocketknight1 commented Sep 24, 2025 •

edited

Loading

Ubospica left a comment •

edited

Loading

Rocketknight1 commented Sep 26, 2025 •

edited

Loading

Ubospica commented Sep 29, 2025 •

edited

Loading