Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
bc5418e
Initial commit
Rocketknight1 Jul 21, 2025
33fc019
Adding more tests, bugfixes, starting tool tests
Rocketknight1 Jul 23, 2025
a78147f
Add support for JSON parsers and some tool tests
Rocketknight1 Jul 24, 2025
21f430f
stash commit
Rocketknight1 Jul 28, 2025
61814ad
stash commit
Rocketknight1 Aug 5, 2025
4edbd58
stash commit
Rocketknight1 Aug 20, 2025
d7e1a96
stash commit
Rocketknight1 Aug 22, 2025
35cd82e
stash commit
Rocketknight1 Aug 26, 2025
ca2545d
Fix cohere schema, fix a lot of the recursive parser code
Rocketknight1 Aug 27, 2025
5b4d8a2
GPT-OSS passing too!
Rocketknight1 Aug 27, 2025
72097c8
Update tests
Rocketknight1 Aug 28, 2025
b1ff13d
make fixup
Rocketknight1 Aug 28, 2025
8b307c7
Offset tracking partially done
Rocketknight1 Aug 29, 2025
67a8cda
stash commit
Rocketknight1 Aug 29, 2025
067bee2
stash commit
Rocketknight1 Aug 29, 2025
8e84d69
Assistant masking Just Works
Rocketknight1 Aug 29, 2025
a84b933
make fixup
Rocketknight1 Aug 29, 2025
05807f0
stash commit
Rocketknight1 Sep 3, 2025
67193a3
stash commit
Rocketknight1 Sep 3, 2025
b72afe6
JMESPath approach
Rocketknight1 Sep 10, 2025
7063b0e
stash commit before i rip this PR apart
Rocketknight1 Sep 12, 2025
c2484e8
Remove broken offset code
Rocketknight1 Sep 12, 2025
72dc308
Remove broken offset code
Rocketknight1 Sep 12, 2025
558b6cf
Update chat parsing code and add tests for Ernie + fix Cohere tests f…
Rocketknight1 Sep 15, 2025
8f6f897
Implement tokenizer method
Rocketknight1 Sep 16, 2025
0551396
jmespath dependency handling
Rocketknight1 Sep 16, 2025
cd2b6f2
Completed TODOs
Rocketknight1 Sep 16, 2025
4d531a3
Add support to TextGenerationPipeline
Rocketknight1 Sep 16, 2025
84f73fd
Update GPT-OSS schema and test cases
Rocketknight1 Sep 17, 2025
0deff35
make fixup
Rocketknight1 Sep 17, 2025
a32712e
Fix typing (??)
Rocketknight1 Sep 17, 2025
8fd283f
missing future import
Rocketknight1 Sep 17, 2025
505b044
Use old typing in tokenization_utils_base.py
Rocketknight1 Sep 17, 2025
cebde25
put jmespath in various extras
Rocketknight1 Sep 17, 2025
60d4b86
Remove accidental newline
Rocketknight1 Sep 17, 2025
fc556fe
Guard tests correctly
Rocketknight1 Sep 17, 2025
7b76324
Remove require_jinja on the schema tests since we don't actually appl…
Rocketknight1 Sep 17, 2025
8925935
make fixup
Rocketknight1 Sep 17, 2025
5b54b47
fix some bad linter changes
Rocketknight1 Sep 18, 2025
8e13d15
Fix docstring
Rocketknight1 Sep 18, 2025
808a628
Push draft documentation
Rocketknight1 Sep 18, 2025
ad5d2a4
Extend tests, more documentation
Rocketknight1 Sep 19, 2025
968cc6d
make fixup
Rocketknight1 Sep 19, 2025
0321b93
docs docs docs
Rocketknight1 Sep 22, 2025
ce43e68
Add Processor support
Rocketknight1 Sep 22, 2025
3107fe5
Add to toctree
Rocketknight1 Sep 22, 2025
cca3216
Flag markdown correctly
Rocketknight1 Sep 22, 2025
d1808fb
Remove double backslashes in docs for simplicity
Rocketknight1 Sep 22, 2025
e98629b
Simplify node-regex-to-dict
Rocketknight1 Sep 22, 2025
05aa04e
Add support to ImageTextToTextPipeline
Rocketknight1 Sep 23, 2025
2c0b076
Add support to ImageTextToTextPipeline and save/loading support in Pr…
Rocketknight1 Sep 23, 2025
bd11548
Begin reworking docs to start fitting in response parsing
Rocketknight1 Sep 23, 2025
5d9e87b
Fix rebase
Rocketknight1 Sep 23, 2025
d04febf
Expand documentation further
Rocketknight1 Sep 23, 2025
e0c892e
Expand documentation further
Rocketknight1 Sep 23, 2025
f17465c
Refactor x-regex-to-dict to x-regex-key-value, update the parser logi…
Rocketknight1 Sep 24, 2025
4229ced
Refactor x-regex-to-dict to x-regex-key-value, update the parser logi…
Rocketknight1 Sep 24, 2025
89daa6a
More docs update
Rocketknight1 Sep 24, 2025
06c1782
Update TextGenerationPipeline to support tools properly
Rocketknight1 Sep 24, 2025
b3921b5
Some rebase fixes
Rocketknight1 Oct 8, 2025
41ad8a8
Re-add is_jmespath_available
Rocketknight1 Oct 8, 2025
2587fa7
Re-add is_jmespath_available
Rocketknight1 Oct 8, 2025
58d8965
Add Qwen3 parser and test, add maybe-json support
Rocketknight1 Oct 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@
title: Tool use
- local: chat_templating_writing
title: Writing a chat template
- local: chat_response_parsing
title: Response parsing
title: Chat with models
- sections:
- local: serving
Expand Down
229 changes: 229 additions & 0 deletions docs/source/en/chat_response_parsing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Response Parsing

It is increasingly common for chat models to generate structured outputs, rather than just a single reply string.
The most common uses for structured outputs are [tool calling](./chat_extras) and [reasoning models](https://huggingface.co/reasoning-course).
Tool calling models can output tool calls, containing the name of the tool to call and any arguments to be passed to it,
while reasoning models often output reasoning steps as a "chain of thought". Some recent models even use both of these,
and may output reasoning and/or one or more tool calls before their final answer.

Models with structured outputs pose a challenge for chat templating, because the output needs to be parsed before it
can be appended to the chat. For a concrete example, let's say we ask [GPT-OSS](https://huggingface.co/openai/gpt-oss-120b)
what the weather is like, and it thinks and decides to call a tool. Here's what the raw model output might look like:

```
<|start|><|assistant|><|channel|>analysis<|message|>The user asks: "What is the weather like in SF?" We need to get the location of the user? The user explicitly asks about SF (San Francisco).
So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data.
So we should call get_current_weather with location "San Francisco, CA". Let's do that.

We will call function get_current_weather.<|end|><|start|>assistant<|channel|>commentary to=functions.get_current_weather <|constrain|>json<|message|>{
"location": "San Francisco, CA"
}
```

And here's what that output would look like as a chat message dict:

```json
{
"role": "assistant",
"thinking": "The user asks: \"What is the weather like in SF?\" We need to get the location of the user? The user explicitly asks about SF (San Francisco). So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data. So we should call get_current_weather with location \"San Francisco, CA\". Let's do that.",
"tool_calls": [
{
"name": "get_current_weather",
"arguments": {
"location": "San Francisco, CA"
}
}
]
}
```

Chat **templates** give us a way to turn messages into formatted input for a model, but we need something else to
parse model output back into a standard message dict. This is what chat **parsing** is for.

## The `parse_response` method

Parsing a chat response on a model that supports it is straightforward. Simply take the raw, decoded output from
`generate()`, and pass it to the tokenizer's `parse_response` method:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM3-3B"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype="auto", device_map="auto")

messages = [
{
"role": "user",
"content": "Hey! Can you summarize the end of the Cold War as briefly as possible? Like, comically briefly. It should really leave out almost most of the relevant information."
}
]

input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt"
).to(model.device)

outputs = model.generate(input_ids, max_new_tokens=1024)[0, input_ids.shape[1]:]
out_text = tokenizer.decode(outputs)
parsed = tokenizer.parse_response(out_text)
print(parsed)
```

And that's all you need to start using response parsing! `parse_response` should return a complete message dict that is ready to be appended to the chat history.
When the tokenizer does not support response parsing, `parse_response` will throw an error. We hope to add support
to more tokenizers over time.

## Developers: Understanding a simple response schema

Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle some tool_calls that are in XML format?
For example, Qwen3-Coder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, our inputs/outputs are in JSON schema format, even when models render them in a different format. We expect the input to a chat template to be JSON schema, or equivalent Python, and the decoded output with chat parsing would be as well. This was to enable a consistent API across models.

This is true even when the model does something totally different, like rendering tool calls in XML! In that case, the chat template and parser should translate the standard API to XML and back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely (assuming we go with this feature and don't replace it with something more like Structural Tag) that we'd add an xml parser to the spec as well, like the json parser that already exists.

the structure of the output message dict. The schema is augmented with additional fields that indicate how the
output message string should be parsed into the expected format. Let's take a look at the schema for a SmolLM response,
excluding tool calls for now:

```python
{
"x-regex": "(?:<think>\n?(?P<thinking>.+?)\n?</think>)?\s*(?P<content>.+?)?\s*(?:<\|im_end\|>|$)",
"type": "object",
"properties": {
"role": {"const": "assistant"},
"content": {"type": "string"},
"thinking": {"type": "string"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI vLLM and LiteLLM use reasoning_content in their parsers. Do we want to use that convention in your examples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmn, I think we often use thinking because that's the key that chat templates use in their input! The idea here is that the returned dict should be ready to append to the chat history without further user intervention

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense - is that just for gpt-oss or have you seen other models adopt thinking too in their chat templates?

To clarify, my question was about whether returning reasoning_content would provide a drop-in replacement for the vLLM reasoning parsers. No strong opinion either way :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it, a lot of LLMs drop the thinking block entirely in their template, because they don't render thinking blocks from past turns. We could probably switch to reasoning_content without too much pain!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also +1 for reasoning_content. But there is also a chance here to standardize

Maybe along the line of https://standardcompletions.org/

}
}
```

We can see that the schema describes a JSON "object" (a `dict`, in other words) with three keys: `role`, `content`, and `thinking`.
Because all assistant responses have the role "assistant", the `role` key is a `const`(ant). The other two keys are strings, extracted
from the named groups in the regex in the `x-regex` field.

Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need
to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like
chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to
save and share the schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2cent: I do think it might be beneficial to keep the parser implementation of tokenizer.parse_response to be in huggingface/tokenizers (i.e Rust implementation).

for openai/harmony format, they do seem also very performant


## Developers: Complex schemas

Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internals. For this, we'll use the `GPT-OSS` schema. GPT-OSS emits both tool calls and thinking blocks, and it uses
an unusual format where model responses are tagged with one of three "channels": `commentary` for things like
tool calls, `analysis` for chain of thought blocks, and `final` for messages intended to be sent to the user.
A full message where the model calls a tool named `get_current_weather` might look like this, with some extra linebreaks added for clarity:

```text
<|channel|>analysis<|message|>
The user asks: "What is the weather like in SF?" So we need to get the current weather in San Francisco, CA.
We need to call get_current_weather function. So we should call get_current_weather with location "San Francisco, CA".
<|end|>
<|start|>assistant<|channel|>commentary
to=functions.get_current_weather <|constrain|>json<|message|>
{
"location": "San Francisco, CA"
}
<|call|>
```

Parsing proceeds recursively; the output of a regex (or other parser) at one level becomes the input to the nodes below it.
In other words, don't feel like you have to parse the entire output in one enormous regex! Instead, start with the schema,
and then add regexes to extract the relevant chunks as you go. Here's a schema that will parse it, with some
explanatory comments:

```python
{
"type": "object",
"properties": {
"role": {"const": "assistant"},
# "content" and "thinking" are both similar to the previous example, and just extract a single string
# However, rather than using a single regex with named groups to extract both, we use a regex in each subkey.
# When an object node has no parser/regex, the entire input string is passed to all of its children, so
# parsing can either be done with named groups at the object level, or with separate regexes at the property level.
"content": {"type": "string", "x-regex": r"<\|channel\|>final<\|message\|>(.*?)(?:<\|end\|>|$)"},
"thinking": {"type": "string", "x-regex": r"<\|channel\|>analysis<\|message\|>(.*?)<\|end\|>"},
"tool_calls": {
# "x-regex-iterator" uses re.findall to find multiple possible manages, and returns them as an
# array/list. You don't need to worry about array handling, though - each item in the array will be
# parsed by the `items` schema, so just write the schema for a single item.
"x-regex-iterator": r"<\|channel\|>commentary (to=functions\..*?<\|message\|>.*?)(?:<\|call\|>|$)",
"type": "array",
"items": {
"type": "object",
"properties": {
# A const property is a fixed value, and the input has no effect on it.
"type": {"const": "function"},
# Here, we wrap the entire tool call dict in a `{"function": ...}` block. The input string is passed through to it unchanged.
"function": {
"type": "object",
"properties": {
"name": {"type": "string", "x-regex": r"^to=functions\.(\w+)"},
"arguments": {
"type": "object",
"x-regex": "<\|message\|>(.*)",
# The "x-parser" field indicates that the extracted string should be parsed as JSON.
# The output is then passed to the schema nodes below and recursive parsing continues.
"x-parser": "json",
"additionalProperties": {"type": "any"},
},
},
},
},
},
},
},
}
```

## Developers: Understanding the parser logic

The parser follows a few simple rules:

1. Each level of the schema receives input from the level above, applies any regex or parser it has, and then passes the output to its children.
2. The root level receives the entire decoded model output string as input.
3. If a node has structured content after parsing (for example, if the regex has named groups and returns a dict, or if the parser returns a dict or list),
then that structured content is mapped to the node's children, and each child node receives its corresponding value as input.
4. If an `object` (dict) node has unstructured (string) output, then the entire string is passed to all of its children. This allows child nodes
to handle parsing individually rather than requiring a single parent regex to extract all keys at once.
5. If an `array` (list) node has unstructured (string) output, then this throws an error.

There is a small set of allowable `x-` keys that indicate how parsing should be done at each node:
- `x-regex`: A regex string to apply to the input. If the regex has named groups, the output is a dict of group names to values. Named groups should only be used in `object` nodes.
Otherwise, the regex must have exactly one unnamed capturing group, and the output is the value of that group as a string.
- `x-regex-iterator`: A regex string to apply to the input using `re.findall()`. The output is a list of all matches.
This should only be used in `array` nodes, and the regex must have exactly one unnamed capturing group. The output is distributed to
the node's `items` schema.
- `x-parser`: Calls a built-in parser to apply to the input. Currently, the only supported parser is `json`, which parses the input string as JSON.
The output is passed to the child nodes for further parsing. Note that the `json` parser can return deeply nested output - in this case, the output
will be progressively unwrapped as it is passed through child nodes. The child nodes do not need additional `x-parser` or `x-regex` fields in this case,
but their structure must match the structure of the parsed JSON.
- `x-parser-args`: Only allowed in conjunction with `x-parser`. This is a dict of additional arguments that control parsing. Right now, the only supported
argument is `transform`, which specifies a `jmespath` transformation to apply to the output. This is useful when the JSON parser returns a structure
that needs to be modified to match the schema.
- `x-regex-key-value`: This is rarely necessary, but it can be useful when parsing key-value pairs in non-JSON format where the names of the keys are not known
in advance, such as when a model emits XML tool calls with arbitrary argument names. The regex must have exactly two named capturing groups,
`key` and `value`, and the output is a dict mapping keys to values. This should only be used in `object` nodes.

In general, multiple regexes/parsers cannot be combined at the same level. The exception is that `x-regex`, returning a single string, can be combined with the other parsers. In this case,
`x-regex` is applied first, and then the output is passed to the other parser, either `x-regex-iterator`, `x-parser`, or `x-regex-key-value`.

Putting these ideas together, you can see that the input flows through the schema, being parsed at each level and then distributed to child nodes. Each level
only needs to extract the input content that is relevant for that part of the schema, and can then let its child nodes handle the rest. Internally, this is handled
with a parser function that receives input, applies any regexes/parsers at the current level, then maps the result to its child nodes before recursively calling itself on each of them.
Recursion terminates when it reaches leaf nodes, usually primitive types like `string` or `number`, which simply return the input they receive.
4 changes: 4 additions & 0 deletions docs/source/en/chat_templating_multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,10 @@ print(processor.decode(out[0]))

The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user.

## Response parsing

TODO section on response parsing with a processor here

## Video inputs

Some vision models also support video inputs. The message format is very similar to the format for [image inputs](#image-inputs).
Expand Down
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@
"importlib_metadata",
"ipadic>=1.0.0,<2.0",
"jinja2>=3.1.0",
"jmespath>=1.0.1",
"kenlm",
"kernels>=0.10.2,<0.11",
"librosa",
Expand Down Expand Up @@ -297,7 +298,7 @@ def run(self):
extras["sentencepiece"] = deps_list("sentencepiece", "protobuf")
extras["tiktoken"] = deps_list("tiktoken", "blobfile")
extras["mistral-common"] = deps_list("mistral-common[opencv]")
extras["chat_template"] = deps_list("jinja2")
extras["chat_template"] = deps_list("jinja2", "jmespath")
extras["testing"] = (
deps_list(
"pytest",
Expand Down
1 change: 1 addition & 0 deletions src/transformers/dependency_versions_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
"importlib_metadata": "importlib_metadata",
"ipadic": "ipadic>=1.0.0,<2.0",
"jinja2": "jinja2>=3.1.0",
"jmespath": "jmespath>=1.0.1",
"kenlm": "kenlm",
"kernels": "kernels>=0.10.2,<0.11",
"librosa": "librosa",
Expand Down
8 changes: 5 additions & 3 deletions src/transformers/pipelines/image_text_to_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -502,9 +502,11 @@ def postprocess(
]
else:
# When we're not starting from a prefill, the output is a new assistant message
generated_text = list(prompt_text.messages) + [
{"role": "assistant", "content": generated_text}
]
if self.processor.response_schema:
assistant_message = self.processor.parse_response(generated_text)
else:
assistant_message = {"role": "assistant", "content": generated_text}
generated_text = list(prompt_text.messages) + [assistant_message]
full_texts.append(generated_text)
generated_texts = full_texts

Expand Down
19 changes: 18 additions & 1 deletion src/transformers/pipelines/text_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,8 @@ def _sanitize_parameters(
continue_final_message=None,
skip_special_tokens=None,
tokenizer_encode_kwargs=None,
tools=None,
documents=None,
**generate_kwargs,
):
# preprocess kwargs
Expand All @@ -170,6 +172,11 @@ def _sanitize_parameters(
preprocess_params["max_length"] = max_length
generate_kwargs["max_length"] = max_length

if tools is not None:
preprocess_params["tools"] = tools
if documents is not None:
preprocess_params["documents"] = documents

if prefix is not None:
preprocess_params["prefix"] = prefix
if prefix:
Expand Down Expand Up @@ -335,6 +342,8 @@ def preprocess(
max_length=None,
continue_final_message=None,
tokenizer_encode_kwargs=None,
tools=None,
documents=None,
**generate_kwargs,
):
# Only set non-None tokenizer kwargs, so as to rely on the tokenizer's defaults
Expand All @@ -359,6 +368,8 @@ def preprocess(
continue_final_message=continue_final_message,
return_dict=True,
return_tensors="pt",
tools=tools,
documents=documents,
**tokenizer_kwargs,
)
else:
Expand Down Expand Up @@ -506,6 +517,7 @@ def postprocess(
continue_final_message = prompt_text.messages[-1]["role"] == "assistant"
if continue_final_message:
# With assistant prefill, concat onto the end of the last message
# TODO Chat decoding when continue_final_message is set
all_text = list(prompt_text.messages)[:-1] + [
{
"role": prompt_text.messages[-1]["role"],
Expand All @@ -514,7 +526,12 @@ def postprocess(
]
else:
# When we're not starting from a prefill, the output is a new assistant message
all_text = list(prompt_text.messages) + [{"role": "assistant", "content": all_text}]
if self.tokenizer.response_schema:
assistant_message = self.tokenizer.parse_response(all_text)
else:
# If there's no schema, then we have to assume it's all content
assistant_message = {"role": "assistant", "content": all_text}
all_text = list(prompt_text.messages) + [assistant_message]
record = {"generated_text": all_text}
for key, values in split_keys.items():
record[key] = values[idx]
Expand Down
Loading