Skip to content

Commit 264cce9

Browse files
Rocketknight1qgallouedecgante
authored
Chat response parsing (#40894)
* Initial commit * Adding more tests, bugfixes, starting tool tests * Add support for JSON parsers and some tool tests * stash commit * stash commit * stash commit * stash commit * stash commit * Fix cohere schema, fix a lot of the recursive parser code * GPT-OSS passing too! * Update tests * make fixup * Offset tracking partially done * stash commit * stash commit * Assistant masking Just Works * make fixup * stash commit * stash commit * JMESPath approach * stash commit before i rip this PR apart * Remove broken offset code * Remove broken offset code * Update chat parsing code and add tests for Ernie + fix Cohere tests for new format * Implement tokenizer method * jmespath dependency handling * Completed TODOs * Add support to TextGenerationPipeline * Update GPT-OSS schema and test cases * make fixup * Fix typing (??) * missing future import * Use old typing in tokenization_utils_base.py * put jmespath in various extras * Remove accidental newline * Guard tests correctly * Remove require_jinja on the schema tests since we don't actually apply chat templates there * make fixup * fix some bad linter changes * Fix docstring * Push draft documentation * Extend tests, more documentation * make fixup * docs docs docs * Add Processor support * Add to toctree * Flag markdown correctly * Remove double backslashes in docs for simplicity * Simplify node-regex-to-dict * Add support to ImageTextToTextPipeline * Add support to ImageTextToTextPipeline and save/loading support in Processors * Begin reworking docs to start fitting in response parsing * Fix rebase * Expand documentation further * Expand documentation further * Refactor x-regex-to-dict to x-regex-key-value, update the parser logic docs section * Refactor x-regex-to-dict to x-regex-key-value, update the parser logic docs section * More docs update * Update TextGenerationPipeline to support tools properly * Some rebase fixes * Re-add is_jmespath_available * Re-add is_jmespath_available * Add Qwen3 parser and test, add maybe-json support * Rollback processor changes - we'll wait for legacy saving to be deprecated * Make fixup * Revert ImageTextToText changes for now * Add pipeline test * make fixup * Resolve a todo * Resolve more TODOs and clean up the spec a little * Add ref in the tools doc * Update docs/source/en/chat_response_parsing.md Co-authored-by: Quentin Gallouédec <[email protected]> * Update src/transformers/utils/chat_parsing_utils.py Co-authored-by: Joao Gante <[email protected]> * Add a docstring for parse_response * Add function docstring and reference it in the docs * Fix generate link * Revert Processor changes for now * Use updated GPT-OSS format * Print the dict keys instead of the whole dict so the example doesn't become too big --------- Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Joao Gante <[email protected]>
1 parent 3f2db2c commit 264cce9

File tree

13 files changed

+911
-4
lines changed

13 files changed

+911
-4
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,8 @@
8888
title: Tool use
8989
- local: chat_templating_writing
9090
title: Writing a chat template
91+
- local: chat_response_parsing
92+
title: Response parsing
9193
title: Chat with models
9294
- sections:
9395
- local: serving

docs/source/en/chat_extras.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,9 +95,12 @@ print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):]))
9595

9696
The chat model called the `get_current_temperature` tool with the correct parameters from the docstring. It inferred France as the location based on Paris, and that it should use Celsius for the units of temperature.
9797

98-
A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history.
98+
A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history. For
99+
models that support [response parsing](./chat_response_parsing), the response parsing will be handled automatically, and you can just use
100+
[`~PreTrainedTokenizer.parse_response] to extract the tool call. For other models, you'll need to manually translate the output
101+
string into a tool call dict.
99102

100-
Hold the call in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.
103+
Regardless of the approach you use, the tool call should go in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.
101104

102105
> [!WARNING]
103106
> Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.
Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Response Parsing
18+
19+
It is increasingly common for chat models to generate structured outputs, rather than just a single reply string.
20+
The most common uses for structured outputs are [tool calling](./chat_extras) and [reasoning models](https://huggingface.co/reasoning-course).
21+
Tool calling models can output tool calls, containing the name of the tool to call and any arguments to be passed to it,
22+
while reasoning models often output reasoning steps as a "chain of thought". Some recent models even use both of these,
23+
and may output reasoning and/or one or more tool calls before their final answer.
24+
25+
Models with structured outputs pose a challenge for chat templating, because the output needs to be parsed before it
26+
can be appended to the chat. For a concrete example, let's say we ask [GPT-OSS](https://huggingface.co/openai/gpt-oss-120b)
27+
what the weather is like, and it thinks and decides to call a tool. Here's what the raw model output might look like:
28+
29+
```txt
30+
<|start|>analysis<|message|>The user asks: "What is the weather like in SF?" We need to get the location of the user? The user explicitly asks about SF (San Francisco).
31+
So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data.
32+
So we should call get_current_weather with location "San Francisco, CA". Let's do that.
33+
We will call function get_current_weather.<|end|><|start|>commentary to=functions.get_current_weather<|channel|>commentary <|constrain|>json<|message|>{"location":"San Francisco, CA"}<|call|>
34+
}
35+
```
36+
37+
But if you want to append this to a chat, you'll need to format it as a chat message dict, like this:
38+
39+
```json
40+
{
41+
"role": "assistant",
42+
"thinking": "The user asks: \"What is the weather like in SF?\" We need to get the location of the user? The user explicitly asks about SF (San Francisco). So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data. So we should call get_current_weather with location \"San Francisco, CA\". Let's do that.",
43+
"tool_calls": [
44+
{
45+
"name": "get_current_weather",
46+
"arguments": {
47+
"location": "San Francisco, CA"
48+
}
49+
}
50+
]
51+
}
52+
```
53+
54+
Chat **templates** give us a way to turn messages into formatted input for a model, but we need something else to
55+
parse model output back into a standard message dict. This is what chat **parsing** is for.
56+
57+
## The [parse_response](~PreTrainedTokenizerBase.parse_response) method
58+
59+
Parsing a chat response on a model that supports it is straightforward. Simply take the raw, decoded output from
60+
[generate](`~generation.GenerationMixin.generate`), and pass it to the tokenizer's [parse_response](~PreTrainedTokenizerBase.parse_response) method:
61+
62+
```python
63+
from transformers import AutoModelForCausalLM, AutoTokenizer
64+
65+
checkpoint = "HuggingFaceTB/SmolLM3-3B"
66+
67+
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
68+
model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype="auto", device_map="auto")
69+
70+
messages = [
71+
{
72+
"role": "user",
73+
"content": "Hey! Can you summarize the end of the Cold War as briefly as possible? Like, comically briefly. It should really leave out almost most of the relevant information."
74+
}
75+
]
76+
77+
input_ids = tokenizer.apply_chat_template(
78+
messages,
79+
add_generation_prompt=True,
80+
tokenize=True,
81+
return_tensors="pt"
82+
).to(model.device)
83+
84+
outputs = model.generate(input_ids, max_new_tokens=1024)[0, input_ids.shape[1]:]
85+
out_text = tokenizer.decode(outputs)
86+
parsed = tokenizer.parse_response(out_text)
87+
print(parsed.keys())
88+
```
89+
90+
And you should get:
91+
92+
```text
93+
dict_keys(['thinking', 'content'])
94+
```
95+
96+
And that's all you need to start using response parsing! `parse_response` should return a complete message dict that is ready to be appended to the chat history.
97+
When the tokenizer does not support response parsing, `parse_response` will throw an error. We hope to add support
98+
to more tokenizers over time.
99+
100+
## Developers: Understanding a simple response schema
101+
102+
Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents
103+
the structure of the output message dict. The schema is augmented with additional fields that indicate how the
104+
output message string should be parsed into the expected format. Let's take a look at the schema for a SmolLM response,
105+
excluding tool calls for now:
106+
107+
```python
108+
{
109+
"x-regex": "(?:<think>\n?(?P<thinking>.+?)\n?</think>)?\s*(?P<content>.+?)?\s*(?:<\|im_end\|>|$)",
110+
"type": "object",
111+
"properties": {
112+
"role": {"const": "assistant"},
113+
"content": {"type": "string"},
114+
"thinking": {"type": "string"}
115+
}
116+
}
117+
```
118+
119+
We can see that the schema describes a JSON "object" (a `dict`, in other words) with three keys: `role`, `content`, and `thinking`.
120+
Because all assistant responses have the role "assistant", the `role` key is a `const`(ant). The other two keys are strings, extracted
121+
from the named groups in the regex in the `x-regex` field.
122+
123+
Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need
124+
to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like
125+
chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to
126+
save and share the schema.
127+
128+
## Developers: Complex schemas
129+
130+
Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser
131+
internals. For this, we'll use the `GPT-OSS` schema. GPT-OSS emits both tool calls and thinking blocks, and it uses
132+
an unusual format where model responses are tagged with one of three "channels": `commentary` for things like
133+
tool calls, `analysis` for chain of thought blocks, and `final` for messages intended to be sent to the user.
134+
A full message where the model calls a tool named `get_current_weather` might look like this, with some extra linebreaks added for clarity:
135+
136+
```text
137+
<|channel|>analysis<|message|>
138+
The user asks: "What is the weather like in SF?" So we need to get the current weather in San Francisco, CA.
139+
We need to call get_current_weather function. So we should call get_current_weather with location "San Francisco, CA".
140+
<|end|>
141+
<|start|>assistant<|channel|>commentary
142+
to=functions.get_current_weather <|constrain|>json<|message|>
143+
{
144+
"location": "San Francisco, CA"
145+
}
146+
<|call|>
147+
```
148+
149+
Parsing proceeds recursively; the output of a regex (or other parser) at one level becomes the input to the nodes below it.
150+
In other words, don't feel like you have to parse the entire output in one enormous regex! Instead, start with the schema,
151+
and then add regexes to extract the relevant chunks as you go. Here's a schema that will parse it, with some
152+
explanatory comments:
153+
154+
```python
155+
{
156+
"type": "object",
157+
"properties": {
158+
"role": {"const": "assistant"},
159+
# "content" and "thinking" are both similar to the previous example, and just extract a single string
160+
# However, rather than using a single regex with named groups to extract both, we use a regex in each subkey.
161+
# When an object node has no parser/regex, the entire input string is passed to all of its children, so
162+
# parsing can either be done with named groups at the object level, or with separate regexes at the property level.
163+
"content": {"type": "string", "x-regex": r"<\|channel\|>final<\|message\|>(.*?)(?:<\|end\|>|$)"},
164+
"thinking": {"type": "string", "x-regex": r"<\|channel\|>analysis<\|message\|>(.*?)<\|end\|>"},
165+
"tool_calls": {
166+
# "x-regex-iterator" uses re.findall to find multiple possible manages, and returns them as an
167+
# array/list. You don't need to worry about array handling, though - each item in the array will be
168+
# parsed by the `items` schema, so just write the schema for a single item.
169+
"x-regex-iterator": r"<\|channel\|>commentary (to=functions\..*?<\|message\|>.*?)(?:<\|call\|>|$)",
170+
"type": "array",
171+
"items": {
172+
"type": "object",
173+
"properties": {
174+
# A const property is a fixed value, and the input has no effect on it.
175+
"type": {"const": "function"},
176+
# Here, we wrap the entire tool call dict in a `{"function": ...}` block. The input string is passed through to it unchanged.
177+
"function": {
178+
"type": "object",
179+
"properties": {
180+
"name": {"type": "string", "x-regex": r"^to=functions\.(\w+)"},
181+
"arguments": {
182+
"type": "object",
183+
"x-regex": "<\|message\|>(.*)",
184+
# The "x-parser" field indicates that the extracted string should be parsed as JSON.
185+
# The output is then passed to the schema nodes below and recursive parsing continues.
186+
"x-parser": "json",
187+
"additionalProperties": {"type": "any"},
188+
},
189+
},
190+
},
191+
},
192+
},
193+
},
194+
},
195+
}
196+
```
197+
198+
## Developers: Understanding the parser logic
199+
200+
The parser follows a few simple rules:
201+
202+
1. Each level of the schema receives input from the level above, applies any regex or parser it has, and then passes the output to its children.
203+
2. The root level receives the entire decoded model output string as input.
204+
3. If a node has structured content after parsing (for example, if the regex has named groups and returns a dict, or if the parser returns a dict or list),
205+
then that structured content is mapped to the node's children, and each child node receives its corresponding value as input.
206+
4. If an `object` (dict) node has unstructured (string) output, then the entire string is passed to all of its children. This allows child nodes
207+
to handle parsing individually rather than requiring a single parent regex to extract all keys at once.
208+
5. If an `array` (list) node has unstructured (string) output, then this throws an error.
209+
210+
There is a small set of allowable `x-` keys that indicate how parsing should be done at each node:
211+
- `x-regex`: A regex string to apply to the input. If the regex has named groups, the output is a dict of group names to values. Named groups should only be used in `object` nodes.
212+
Otherwise, the regex must have exactly one unnamed capturing group, and the output is the value of that group as a string.
213+
- `x-regex-iterator`: A regex string to apply to the input using `re.findall()`. The output is a list of all matches.
214+
This should only be used in `array` nodes, and the regex must have exactly one unnamed capturing group. The output is distributed to
215+
the node's `items` schema.
216+
- `x-parser`: Calls a built-in parser to apply to the input. Currently, the only supported parser is `json`, which parses the input string as JSON.
217+
The output is passed to the child nodes for further parsing. Note that the `json` parser can return deeply nested output - in this case, the output
218+
will be progressively unwrapped as it is passed through child nodes. The child nodes do not need additional `x-parser` or `x-regex` fields in this case,
219+
but their structure must match the structure of the parsed JSON.
220+
- `x-parser-args`: Only allowed in conjunction with `x-parser`. This is a dict of additional arguments that control parsing. Right now, the only supported
221+
argument is `transform`, which specifies a `jmespath` transformation to apply to the output. This is useful when the JSON parser returns a structure
222+
that needs to be modified to match the schema.
223+
- `x-regex-key-value`: This is rarely necessary, but it can be useful when parsing key-value pairs in non-JSON format where the names of the keys are not known
224+
in advance, such as when a model emits XML tool calls with arbitrary argument names. The regex must have exactly two named capturing groups,
225+
`key` and `value`, and the output is a dict mapping keys to values. This should only be used in `object` nodes.
226+
227+
In general, multiple regexes/parsers cannot be combined at the same level. The exception is that `x-regex`, returning a single string, can be combined with the other parsers. In this case,
228+
`x-regex` is applied first, and then the output is passed to the other parser, either `x-regex-iterator`, `x-parser`, or `x-regex-key-value`.
229+
230+
Putting these ideas together, you can see that the input flows through the schema, being parsed at each level and then distributed to child nodes. Each level
231+
only needs to extract the input content that is relevant for that part of the schema, and can then let its child nodes handle the rest. Internally, this is handled
232+
with a parser function that receives input, applies any regexes/parsers at the current level, then maps the result to its child nodes before recursively calling itself on each of them.
233+
Recursion terminates when it reaches leaf nodes, usually primitive types like `string` or `number`, which simply return the input they receive.

setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@
117117
"importlib_metadata",
118118
"ipadic>=1.0.0,<2.0",
119119
"jinja2>=3.1.0",
120+
"jmespath>=1.0.1",
120121
"kenlm",
121122
"kernels>=0.10.2,<0.11",
122123
"librosa",
@@ -294,7 +295,7 @@ def run(self):
294295
extras["sentencepiece"] = deps_list("sentencepiece", "protobuf")
295296
extras["tiktoken"] = deps_list("tiktoken", "blobfile")
296297
extras["mistral-common"] = deps_list("mistral-common[opencv]")
297-
extras["chat_template"] = deps_list("jinja2")
298+
extras["chat_template"] = deps_list("jinja2", "jmespath")
298299
extras["testing"] = (
299300
deps_list(
300301
"pytest",

src/transformers/dependency_versions_table.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
"importlib_metadata": "importlib_metadata",
2828
"ipadic": "ipadic>=1.0.0,<2.0",
2929
"jinja2": "jinja2>=3.1.0",
30+
"jmespath": "jmespath>=1.0.1",
3031
"kenlm": "kenlm",
3132
"kernels": "kernels>=0.10.2,<0.11",
3233
"librosa": "librosa",

src/transformers/pipelines/text_generation.py

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,8 @@ def _sanitize_parameters(
152152
continue_final_message=None,
153153
skip_special_tokens=None,
154154
tokenizer_encode_kwargs=None,
155+
tools=None,
156+
documents=None,
155157
**generate_kwargs,
156158
):
157159
# preprocess kwargs
@@ -170,6 +172,11 @@ def _sanitize_parameters(
170172
preprocess_params["max_length"] = max_length
171173
generate_kwargs["max_length"] = max_length
172174

175+
if tools is not None:
176+
preprocess_params["tools"] = tools
177+
if documents is not None:
178+
preprocess_params["documents"] = documents
179+
173180
if prefix is not None:
174181
preprocess_params["prefix"] = prefix
175182
if prefix:
@@ -335,6 +342,8 @@ def preprocess(
335342
max_length=None,
336343
continue_final_message=None,
337344
tokenizer_encode_kwargs=None,
345+
tools=None,
346+
documents=None,
338347
**generate_kwargs,
339348
):
340349
# Only set non-None tokenizer kwargs, so as to rely on the tokenizer's defaults
@@ -359,6 +368,8 @@ def preprocess(
359368
continue_final_message=continue_final_message,
360369
return_dict=True,
361370
return_tensors="pt",
371+
tools=tools,
372+
documents=documents,
362373
**tokenizer_kwargs,
363374
)
364375
else:
@@ -514,7 +525,12 @@ def postprocess(
514525
]
515526
else:
516527
# When we're not starting from a prefill, the output is a new assistant message
517-
all_text = list(prompt_text.messages) + [{"role": "assistant", "content": all_text}]
528+
if self.tokenizer.response_schema:
529+
assistant_message = self.tokenizer.parse_response(all_text)
530+
else:
531+
# If there's no schema, then we have to assume it's all content
532+
assistant_message = {"role": "assistant", "content": all_text}
533+
all_text = list(prompt_text.messages) + [assistant_message]
518534
record = {"generated_text": all_text}
519535
for key, values in split_keys.items():
520536
record[key] = values[idx]

src/transformers/testing_utils.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@
102102
is_huggingface_hub_greater_or_equal,
103103
is_ipex_available,
104104
is_jinja_available,
105+
is_jmespath_available,
105106
is_jumanpp_available,
106107
is_kernels_available,
107108
is_levenshtein_available,
@@ -508,6 +509,13 @@ def require_jinja(test_case):
508509
return unittest.skipUnless(is_jinja_available(), "test requires jinja")(test_case)
509510

510511

512+
def require_jmespath(test_case):
513+
"""
514+
Decorator marking a test that requires jmespath. These tests are skipped when jmespath isn't installed.
515+
"""
516+
return unittest.skipUnless(is_jmespath_available(), "test requires jmespath")(test_case)
517+
518+
511519
def require_onnx(test_case):
512520
return unittest.skipUnless(is_onnx_available(), "test requires ONNX")(test_case)
513521

0 commit comments

Comments
 (0)