Skip to content

Conversation

Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented Sep 15, 2025

This PR is a replacement for #39609. The idea is that models can include a message schema, allowing model output to be parsed into a structured form. The original plan was to allow parsing of the entire chat history, essentially the inverse operation of apply_chat_template, but the schemas involved were too complex and there was no realistic hope that users would be able to write them!

This PR simplifies things - we focus only on parsing the output generated by the model. This is mainly relevant for tool calling and chain of thought models, both of which emit structured output that often needs manual handling before it can be appended to the chat.

The output schema is stored as a key on the tokenizer. It consists of a JSON schema, representing the structure of messages emitted by the model, with additional x- keys that indicate how parsing should be performed. Parsing is mostly done through regexes, but there is also support for common tool call formats like json to be directly parsed without you having to write an entire JSON regex parser 😅

Work to do:

  • Actually move parsing onto the tokenizer
  • Add support to TextGenerationPipeline
  • Add load/saving of output schemas
  • Document, document, document (50% done)
  • Figure out what we're calling it ("chat parsing"?)
  • Make sure extra fields don't break older versions
  • Support parsing in Processor classes too
  • Support parsing in ImageTextToText pipeline
  • Write schemas for some popular models
    • GPT-OSS
    • Cohere
    • ERNIE
    • Deepseek (tool calling isn't working in their template, will fix after)
    • SmolLM3
    • Qwen3
    • Qwen3-coder (requires xml support)

Documentation to do:

  • Expand parse_response explanation and show how it works with TextGenerationPipeline
  • Move the writing guide to a separate doc, and expand it?
  • Write the reference for the allowed x- fields

Open questions:

  • Fold chat_parsing_utils.py into chat_template_utils.py?

@Rocketknight1 Rocketknight1 mentioned this pull request Sep 15, 2025
10 tasks
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1 Rocketknight1 marked this pull request as ready for review September 17, 2025 14:33
@Rocketknight1 Rocketknight1 force-pushed the chat_schemas branch 2 times, most recently from 1dfdb73 to 6610d38 Compare September 17, 2025 16:27
@Rocketknight1
Copy link
Member Author

cc @zucchini-nlp do you know what the popular VLMs using reasoning or tool calls are right now? I'd like to add support + testing in the processor too.

@zucchini-nlp
Copy link
Member

From the ones we have already in library, Ovis2 chat templates has some parts for tool usage. Other than that, haven't seen explicit reasoning/tools in template

@Rocketknight1
Copy link
Member Author

Thank you! I'll take a look at Ovis2, if not I can worry about adding Processor schemas when we actually need them

@Rocketknight1 Rocketknight1 changed the title Chat output schemas Chat response parsing Sep 22, 2025
@Rocketknight1 Rocketknight1 force-pushed the chat_schemas branch 2 times, most recently from b8c0a94 to ab99161 Compare September 23, 2025 16:28
"properties": {
"role": {"const": "assistant"},
"content": {"type": "string"},
"thinking": {"type": "string"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI vLLM and LiteLLM use reasoning_content in their parsers. Do we want to use that convention in your examples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmn, I think we often use thinking because that's the key that chat templates use in their input! The idea here is that the returned dict should be ready to append to the chat history without further user intervention

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense - is that just for gpt-oss or have you seen other models adopt thinking too in their chat templates?

To clarify, my question was about whether returning reasoning_content would provide a drop-in replacement for the vLLM reasoning parsers. No strong opinion either way :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it, a lot of LLMs drop the thinking block entirely in their template, because they don't render thinking blocks from past turns. We could probably switch to reasoning_content without too much pain!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also +1 for reasoning_content. But there is also a chance here to standardize

Maybe along the line of https://standardcompletions.org/

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very useful for huggingface/trl#4115


def parse_response(self, response: str, schema: Optional[Union[list, dict]] = None):
if schema is None:
if getattr(self, "response_schema", None) is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
if getattr(self, "response_schema", None) is None:
if not hasattr(self, "response_schema"):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fails in the case where self.response_schema = None, which we might set at init!

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Sep 24, 2025

Made a quick demo so reviewers can try this out. Just switch to the chat_schemas branch and then:

from transformers import pipeline

model_name = "Rocketknight1/qwen-response-test"

pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")

prompt = "Hey, what's 6 * 8?"
messages = [
    {"role": "user", "content": prompt}
]

out = pipe(messages, max_new_tokens=512)
print(out[0]["generated_text"][-1])

If it works, you should see the output correctly split into thinking and content blocks!

And here's a quick demo for tool calling:

from transformers import pipeline

model_name = "Rocketknight1/qwen-response-test"

def get_current_weather(location: str):
    """
    Gets the weather at a given location

    Args:
    location: The location to get the weather for
    """
    return 20.

pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")

prompt = "Hey, what's the weather like in Paris?"
messages = [
    {"role": "user", "content": prompt}
]

out = pipe(messages, tools=[get_current_weather], max_new_tokens=512)
print(out[0]["generated_text"][-1])

The tool call should be correctly parsed as a key in the response dict.

@alecsolder
Copy link

Hey, I've recently been working on the tool call handling in vLLM and came across this. Lots of really cool stuff going on here!

A couple questions I have:

  • Would it be possible to support using a library instead of a regex for parsing the output from the model? The Harmony library from OpenAI is pretty complicated already and replicating the logic in a regex would be pretty tough. For example:
  • Would it be possible to implement this at the token level instead of after decoding? Doing it at the token level would remove the risk of using text versions of special tokens to break parsing.

@bbrowning
Copy link

  • Would it be possible to implement this at the token level instead of after decoding? Doing it at the token level would remove the risk of using text versions of special tokens to break parsing.

To this point, the Harmony library expects to deal directly with token ids, right? So if we allow integration with model provider libraries for this parsing, we may have to support token ids.

Copy link
Contributor

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi there, I help with structured outputs and tool calling on vLLM.

I left some comments here. lmk if there is anything we can help with

There is also a wip for tool call parser on our end, that is largely depends on xgrammar structural tag, so just want to make sure that these wouldn't duplicate some of the work here as well.

"properties": {
"role": {"const": "assistant"},
"content": {"type": "string"},
"thinking": {"type": "string"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also +1 for reasoning_content. But there is also a chance here to standardize

Maybe along the line of https://standardcompletions.org/

Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need
to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like
chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to
save and share the schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2cent: I do think it might be beneficial to keep the parser implementation of tokenizer.parse_response to be in huggingface/tokenizers (i.e Rust implementation).

for openai/harmony format, they do seem also very performant


## Developers: Complex schemas

Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Developers: Understanding a simple response schema

Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle some tool_calls that are in XML format?
For example, Qwen3-Coder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, our inputs/outputs are in JSON schema format, even when models render them in a different format. We expect the input to a chat template to be JSON schema, or equivalent Python, and the decoded output with chat parsing would be as well. This was to enable a consistent API across models.

This is true even when the model does something totally different, like rendering tool calls in XML! In that case, the chat template and parser should translate the standard API to XML and back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely (assuming we go with this feature and don't replace it with something more like Structural Tag) that we'd add an xml parser to the spec as well, like the json parser that already exists.

Copy link

@Ubospica Ubospica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference @aarnphm! This is a really cool and useful PR! We’ve also had discussions with the vLLM team about how to build a unified tool calling parser interface, and I think this will be very helpful as well.

In XGrammar, we previously implemented a Structural Tag whose goal is to describe various kinds of structures. With guided decoding, it ensures the output strictly follows the defined structure, while also leaving room for potential support of parsing output in the future. This will be merged into the vLLM/SGLang/TensorRT main branches within the next two weeks. Its docs can be found at https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html. It also supports constraining output in XML format.

I think that also aligns very well with the goal of this PR. The difference is that this API focuses more on high-level semantics, whereas Structural Tag emphasizes structure at the raw text level. I believe there’s a lot of space for collaboration here, and I’d love to explore this further.

@aarnphm
Copy link
Contributor

aarnphm commented Sep 25, 2025

I think that also aligns very well with the goal of this PR. The difference is that this API focuses more on high-level semantics, whereas Structural Tag emphasizes structure at the raw text level. I believe there’s a lot of space for collaboration here, and I’d love to explore this further.

I also would like to see a push for xgrammar upstream as well here.

@Rocketknight1
Copy link
Member Author

@Ubospica yes, Structural Tag looks very similar to this! Constrained generation and output parsing obviously have a lot of overlap, since they both define an output schema of some kind. Do you know how output parsing was intended to be implemented for Structural Tag?

@Rocketknight1
Copy link
Member Author

Actually, a better question: If we want to align with XGrammar, should we try to extend the Structural Tag spec to allow output parsing, or should we have an output schema for parsing that's separate from the Structural Tag?

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Sep 26, 2025

I thought about it for a while and here's what I have:

The main thing response parsing needs that structural tag doesn't have is a way to map segments of the output to paths in the parsed message dict. For example, it is very common that LLMs render tool calls or tool definitions as JSON schema, but not in the standard OpenAI format our API expects for tool calls. For example, they may rename "arguments" to "parameters", or they may leave out the "function" key, etc.

The goal of chat parsing is that the model should return a message dictionary that is ready to be appended to the chat without any further parsing from the user, which means it must be in the standard API format. This means we need a way to map "segments" of the structural tag schema to segments of the output dict. If the model emits:

<tool_call>
{
   "name": "get_current_weather",
   "parameters": {
       "location": "Paris"
    }
}
</tool_call>

The output we want from parsing is:

{
    "role": "assistant",
    "tool_calls": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "arguments": {
                    "location": "Paris"
                }
           }
       ]
}

It's easy to express the model output above as a structural tag schema, but it's trickier to do the mapping from there to the output format we want. It may even be strictly impossible in some cases - some of the structural tag keys like regex and particularly grammar don't seem like they have a good way to separate internal segments. With regex we could have named capturing groups, maybe, but with EBNF grammars it won't work, without some nonstandard additions to the EBNF spec. Even if we could make it work, it seems like it would clutter the structural tag spec, which right now is clean and usable.

Maybe it makes more sense to keep response parsing separate, but I can upstream it or something like it to XGrammar as a separate feature, alongside structural tags for constrained generation?

@qgallouedec

This comment was marked as resolved.

@qgallouedec
Copy link
Member

For information, multiple tool calls isn't supported yet:

from transformers import pipeline

model_name = "Rocketknight1/qwen-response-test"

def get_current_weather(location: str):
    """
    Gets the weather at a given location

    Args:
    location: The location to get the weather for
    """
    return 20.

pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")

prompt = "Hey, what's the weather like in Paris and London?"
messages = [
    {"role": "user", "content": prompt}
]

tools = [get_current_weather]
out = pipe(messages, tools=tools, max_new_tokens=512)
print(out[0]["generated_text"][-1])
Traceback (most recent call last):
  File "/fsx/qgallouedec/transformers/src/transformers/utils/chat_parsing_utils.py", line 114, in recursive_parse
    parsed_json = json.loads(node_content)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/json/decoder.py", line 341, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 3 column 1 (char 69)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/fsx/qgallouedec/trl/parsing.py", line 22, in <module>
    out = pipe(messages, tools=tools, max_new_tokens=512)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/pipelines/text_generation.py", line 325, in __call__
    return super().__call__(Chat(text_inputs), **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/pipelines/base.py", line 1289, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/pipelines/base.py", line 1297, in run_single
    outputs = self.postprocess(model_outputs, **postprocess_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/pipelines/text_generation.py", line 530, in postprocess
    assistant_message = self.tokenizer.parse_response(all_text)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/tokenization_utils_base.py", line 1803, in parse_response
    return recursive_parse(response, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/utils/chat_parsing_utils.py", line 173, in recursive_parse
    parsed_schema[key] = recursive_parse(node_content[key], child_node)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/utils/chat_parsing_utils.py", line 116, in recursive_parse
    raise ValueError(
ValueError: Node has JSON parser but could not parse its contents as JSON: 
{"name": "get_current_weather", "arguments": {"location": "Paris"}}
</tool_call>

<tool_call>
{"name": "get_current_weather", "arguments": {"location": "London"}}

Error: Extra data: line 3 column 1 (char 69)

@Ubospica
Copy link

Ubospica commented Sep 29, 2025

I thought about it for a while and here's what I have:

The main thing response parsing needs that structural tag doesn't have is a way to map segments of the output to paths in the parsed message dict. For example, it is very common that LLMs render tool calls or tool definitions as JSON schema, but not in the standard OpenAI format our API expects for tool calls. For example, they may rename "arguments" to "parameters", or they may leave out the "function" key, etc.

The goal of chat parsing is that the model should return a message dictionary that is ready to be appended to the chat without any further parsing from the user, which means it must be in the standard API format. This means we need a way to map "segments" of the structural tag schema to segments of the output dict. If the model emits:

<tool_call>
{
   "name": "get_current_weather",
   "parameters": {
       "location": "Paris"
    }
}
</tool_call>

The output we want from parsing is:

{
    "role": "assistant",
    "tool_calls": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "arguments": {
                    "location": "Paris"
                }
           }
       ]
}

It's easy to express the model output above as a structural tag schema, but it's trickier to do the mapping from there to the output format we want. It may even be strictly impossible in some cases - some of the structural tag keys like regex and particularly grammar don't seem like they have a good way to separate internal segments. With regex we could have named capturing groups, maybe, but with EBNF grammars it won't work, without some nonstandard additions to the EBNF spec. Even if we could make it work, it seems like it would clutter the structural tag spec, which right now is clean and usable.

Maybe it makes more sense to keep response parsing separate, but I can upstream it or something like it to XGrammar as a separate feature, alongside structural tags for constrained generation?

Thanks for the great discussion here! Previously xgrammar had a proposal for parsing some output text with the structural tag at mlc-ai/xgrammar#303. The API is like

def parse_with_structural_tag(input_str, structural_tag) -> StructuralTagResult: pass

It's not finished yet, but it should be easy to pick it up recently.

This maps the output text to the structure of the StructuralTag. To further use it in tool calling (and also parallel tool calling), we still need to map it into the OpenAI format (the "tool_calls": format you mentioned).

I agree with you that with regex/ebnf the output is harder to parse. It's possible with xgrammar's builtin Earley parser, but may introduce multiple possible parsing result and we need to determine which one to use. Maybe we can restrict the parser with StructuralTags without any regex/ebnf content.

One possible and suitable solution is, we can have an extra layer of abstraction for tool calling and further lower that into structural tag, and use the structural tag parser to handle that, then convert it to OpenAI tool calling format. The benefit could be: 1) it's easier to handle the different tool calling formats of different models; 2) we can unify the constraint decoding (or guided decoding) and tool calling parsing.

@Rocketknight1
Copy link
Member Author

Quick update here: I experimented with dropping this PR and instead adding parsing support to XGrammar constrained generation schemas. However, after experimentation, I think there's a big difference between the schemas needed for constrained generation and the schemas needed for output parsing. In particular, constrained generation schemas need a lot of knowledge about tool argument names and types if they want to ensure that all tool calls are valid, but a parser can skip all this and simply record the model output with a generic schema that works for all tools.

I think constrained generation schemas are still very useful! But I'm much less confident that it's a good idea to overload them with both tasks; there's actually a lot of friction between them. It probably makes more sense for models to have both a "response schema" and a "generation schema", and this PR will remain focused on the first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants