-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Chat response parsing #40894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Chat response parsing #40894
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
1dfdb73
to
6610d38
Compare
cc @zucchini-nlp do you know what the popular VLMs using reasoning or tool calls are right now? I'd like to add support + testing in the processor too. |
584c843
to
73470de
Compare
From the ones we have already in library, Ovis2 chat templates has some parts for tool usage. Other than that, haven't seen explicit reasoning/tools in template |
Thank you! I'll take a look at Ovis2, if not I can worry about adding |
b8c0a94
to
ab99161
Compare
"properties": { | ||
"role": {"const": "assistant"}, | ||
"content": {"type": "string"}, | ||
"thinking": {"type": "string"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmn, I think we often use thinking
because that's the key that chat templates use in their input! The idea here is that the returned dict should be ready to append to the chat history without further user intervention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense - is that just for gpt-oss or have you seen other models adopt thinking
too in their chat templates?
To clarify, my question was about whether returning reasoning_content
would provide a drop-in replacement for the vLLM reasoning parsers. No strong opinion either way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention it, a lot of LLMs drop the thinking
block entirely in their template, because they don't render thinking blocks from past turns. We could probably switch to reasoning_content
without too much pain!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also +1 for reasoning_content
. But there is also a chance here to standardize
Maybe along the line of https://standardcompletions.org/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very useful for huggingface/trl#4115
|
||
def parse_response(self, response: str, schema: Optional[Union[list, dict]] = None): | ||
if schema is None: | ||
if getattr(self, "response_schema", None) is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
if getattr(self, "response_schema", None) is None: | |
if not hasattr(self, "response_schema"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fails in the case where self.response_schema = None
, which we might set at init!
Made a quick demo so reviewers can try this out. Just switch to the from transformers import pipeline
model_name = "Rocketknight1/qwen-response-test"
pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")
prompt = "Hey, what's 6 * 8?"
messages = [
{"role": "user", "content": prompt}
]
out = pipe(messages, max_new_tokens=512)
print(out[0]["generated_text"][-1]) If it works, you should see the output correctly split into And here's a quick demo for tool calling: from transformers import pipeline
model_name = "Rocketknight1/qwen-response-test"
def get_current_weather(location: str):
"""
Gets the weather at a given location
Args:
location: The location to get the weather for
"""
return 20.
pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")
prompt = "Hey, what's the weather like in Paris?"
messages = [
{"role": "user", "content": prompt}
]
out = pipe(messages, tools=[get_current_weather], max_new_tokens=512)
print(out[0]["generated_text"][-1]) The tool call should be correctly parsed as a key in the response dict. |
Hey, I've recently been working on the tool call handling in vLLM and came across this. Lots of really cool stuff going on here! A couple questions I have:
|
To this point, the Harmony library expects to deal directly with token ids, right? So if we allow integration with model provider libraries for this parsing, we may have to support token ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi there, I help with structured outputs and tool calling on vLLM.
I left some comments here. lmk if there is anything we can help with
There is also a wip for tool call parser on our end, that is largely depends on xgrammar structural tag, so just want to make sure that these wouldn't duplicate some of the work here as well.
"properties": { | ||
"role": {"const": "assistant"}, | ||
"content": {"type": "string"}, | ||
"thinking": {"type": "string"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also +1 for reasoning_content
. But there is also a chance here to standardize
Maybe along the line of https://standardcompletions.org/
Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need | ||
to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like | ||
chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to | ||
save and share the schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2cent: I do think it might be beneficial to keep the parser implementation of tokenizer.parse_response
to be in huggingface/tokenizers (i.e Rust implementation).
for openai/harmony format, they do seem also very performant
|
||
## Developers: Complex schemas | ||
|
||
Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be also interested
|
||
## Developers: Understanding a simple response schema | ||
|
||
Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle some tool_calls
that are in XML format?
For example, Qwen3-Coder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, our inputs/outputs are in JSON schema format, even when models render them in a different format. We expect the input to a chat template to be JSON schema, or equivalent Python, and the decoded output with chat parsing would be as well. This was to enable a consistent API across models.
This is true even when the model does something totally different, like rendering tool calls in XML! In that case, the chat template and parser should translate the standard API to XML and back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's likely (assuming we go with this feature and don't replace it with something more like Structural Tag) that we'd add an xml
parser to the spec as well, like the json
parser that already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference @aarnphm! This is a really cool and useful PR! We’ve also had discussions with the vLLM team about how to build a unified tool calling parser interface, and I think this will be very helpful as well.
In XGrammar, we previously implemented a Structural Tag whose goal is to describe various kinds of structures. With guided decoding, it ensures the output strictly follows the defined structure, while also leaving room for potential support of parsing output in the future. This will be merged into the vLLM/SGLang/TensorRT main branches within the next two weeks. Its docs can be found at https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html. It also supports constraining output in XML format.
I think that also aligns very well with the goal of this PR. The difference is that this API focuses more on high-level semantics, whereas Structural Tag emphasizes structure at the raw text level. I believe there’s a lot of space for collaboration here, and I’d love to explore this further.
I also would like to see a push for xgrammar upstream as well here. |
@Ubospica yes, Structural Tag looks very similar to this! Constrained generation and output parsing obviously have a lot of overlap, since they both define an output schema of some kind. Do you know how output parsing was intended to be implemented for Structural Tag? |
Actually, a better question: If we want to align with XGrammar, should we try to extend the Structural Tag spec to allow output parsing, or should we have an output schema for parsing that's separate from the Structural Tag? |
I thought about it for a while and here's what I have: The main thing response parsing needs that structural tag doesn't have is a way to map segments of the output to paths in the parsed message dict. For example, it is very common that LLMs render tool calls or tool definitions as JSON schema, but not in the standard OpenAI format our API expects for tool calls. For example, they may rename "arguments" to "parameters", or they may leave out the "function" key, etc. The goal of chat parsing is that the model should return a message dictionary that is ready to be appended to the chat without any further parsing from the user, which means it must be in the standard API format. This means we need a way to map "segments" of the structural tag schema to segments of the output dict. If the model emits:
The output we want from parsing is:
It's easy to express the model output above as a structural tag schema, but it's trickier to do the mapping from there to the output format we want. It may even be strictly impossible in some cases - some of the structural tag keys like Maybe it makes more sense to keep response parsing separate, but I can upstream it or something like it to XGrammar as a separate feature, alongside structural tags for constrained generation? |
This comment was marked as resolved.
This comment was marked as resolved.
For information, multiple tool calls isn't supported yet: from transformers import pipeline
model_name = "Rocketknight1/qwen-response-test"
def get_current_weather(location: str):
"""
Gets the weather at a given location
Args:
location: The location to get the weather for
"""
return 20.
pipe = pipeline("text-generation", model_name, dtype="auto", device_map="auto")
prompt = "Hey, what's the weather like in Paris and London?"
messages = [
{"role": "user", "content": prompt}
]
tools = [get_current_weather]
out = pipe(messages, tools=tools, max_new_tokens=512)
print(out[0]["generated_text"][-1])
|
Thanks for the great discussion here! Previously xgrammar had a proposal for parsing some output text with the structural tag at mlc-ai/xgrammar#303. The API is like
It's not finished yet, but it should be easy to pick it up recently. This maps the output text to the structure of the StructuralTag. To further use it in tool calling (and also parallel tool calling), we still need to map it into the OpenAI format (the I agree with you that with regex/ebnf the output is harder to parse. It's possible with xgrammar's builtin Earley parser, but may introduce multiple possible parsing result and we need to determine which one to use. Maybe we can restrict the parser with StructuralTags without any regex/ebnf content. One possible and suitable solution is, we can have an extra layer of abstraction for tool calling and further lower that into structural tag, and use the structural tag parser to handle that, then convert it to OpenAI tool calling format. The benefit could be: 1) it's easier to handle the different tool calling formats of different models; 2) we can unify the constraint decoding (or guided decoding) and tool calling parsing. |
ee0a466
to
6a8f095
Compare
Quick update here: I experimented with dropping this PR and instead adding parsing support to I think constrained generation schemas are still very useful! But I'm much less confident that it's a good idea to overload them with both tasks; there's actually a lot of friction between them. It probably makes more sense for models to have both a "response schema" and a "generation schema", and this PR will remain focused on the first. |
a68bd5b
to
56be799
Compare
…y chat templates there
34bb26b
to
58d8965
Compare
This PR is a replacement for #39609. The idea is that models can include a message schema, allowing model output to be parsed into a structured form. The original plan was to allow parsing of the entire chat history, essentially the inverse operation of
apply_chat_template
, but the schemas involved were too complex and there was no realistic hope that users would be able to write them!This PR simplifies things - we focus only on parsing the output generated by the model. This is mainly relevant for tool calling and chain of thought models, both of which emit structured output that often needs manual handling before it can be appended to the chat.
The output schema is stored as a key on the tokenizer. It consists of a JSON schema, representing the structure of messages emitted by the model, with additional
x-
keys that indicate how parsing should be performed. Parsing is mostly done through regexes, but there is also support for common tool call formats likejson
to be directly parsed without you having to write an entire JSON regex parser 😅Work to do:
Processor
classes tooImageTextToText
pipelineDeepseek(tool calling isn't working in their template, will fix after)xml
support)Documentation to do:
parse_response
explanation and show how it works withTextGenerationPipeline
x-
fieldsOpen questions:
chat_parsing_utils.py
intochat_template_utils.py
?