-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639
base: master
Are you sure you want to change the base?
Conversation
…enerators + parsers
…date_model_chat_template
…f taking server down
…taking server down
…+ docling (pdf -> markdown), sparql for dbpedia and wikidata
…rmes tool use template)
Hey @ochafik , this is impressive! It's a nice idea to bring a jinja parser into llama.cpp I'm interested in this direction. But the current PR is quite big to review. Do you think it's possible to split the jinja part into a dedicated PR? Btw me, @Vaibhavs10 and @Rocketknight1 (Matt) can help to further improve the jinja implementation. My suggestions are:
|
Hey @ngxson, thanks for the enthusiasm! As it turns out, I've just got the approvals today (🎉) from my employer to launch Minja in its own repo → https://github.com/google/minja (this way I'll be able to setup more tests - including fuzzing - and distinct CI and take some of the complexity away from this PR & llama.cpp in general - just copy I'll resume updates to this PR, I've been experimenting along the lines of some of @ggerganov 's comments but needs more work.
Here's the list of tested model templates: https://github.com/google/minja/blob/main/tests/CMakeLists.txt#L22 I'd love suggestions of additional models, feel free to open a bug or PR there with either things that work or things that don't. |
…-temperature, fix --seed
This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).
Background
It tackles two main problems related to tool calling:
Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if
"tool_choice": "required"
is specified in the request). It's not currently possible to say.* "<tool_call>" constrained "</tool_call>"
as the leading.*
will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in thethoughtful_steps
style, but the native tool call styles were still problematic.Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for
content
vs.tool_call
outputs, and even mixes of the two (for the few models that support that).<|python_tag|>
and{"name": "toolN"
(for eachtoolN
in the list oftools
in the request).{"
which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.<tool_call>
.>>>toolN\n
for eachtoolN
.<function=
and<|python_tag|>
[TOOL_CALLS]
but it doesn't seem to (ever?) be emitted, so we're triggering on{"
instead for now.tool_choice
isrequired
)Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
minja.hpp
), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.With this intro out of the way, here are the parts of this PR that could possibly be sent separately (
currently itemizedto be reitemized as commits):grammar_trigger_words +
llama_antiprompts
: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligningcli
&server
(e.g. single-token stop logic) and handling grammar trigger words.minja.hpp
+test/{test-minja.cpp,update_jinja_goldens.py,chat/{contexts,templates,goldens}}
: minimal Jinja templating engine and its tests against actual templates & a few test contexts (now in its own repo: https://github.com/google/minja)Tool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro
Integration in
llama-server
(fenced by--jinja
) w/tools
,tool_choice
support + updatedresponse_format
compliance.Minimal examples/agent with a tool call / action loop, barebones tools and instructions / support to run them in a siloed docker container (see usage below)
How to use / test
While any model should work (using generic support based on JSON schema constraints), this PR supports the native call style of a few models:
For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the
tool_use
variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspectinghttp://localhost:8080/props
, and inspect the logs forTool call style:
.Here's how to run an agent w/ local tool call:
Install prerequisite: uv (used to simplify python deps)
Run
llama-server
w/ any model:Run the tools in examples/agent/tools inside a docker container for some level of isolation (+ sneaky logging of outgoing http and https traffic: you wanna watch over those agents' shoulders for the time being 🧐). Check http://localhost:8088/docs to see the tools exposed.
Run the agent with some goal
uv run examples/agent/run.py "What is the sum of 2535 squared and 32222000403?"
See output w/ Hermes-3-Llama-3.1-8B
uv run examples/agent/run.py "What is the best BBQ joint in Laguna Beach?"
See output w/ Hermes-3-Llama-3.1-8B
uv run examples/agent/run.py "Search (with brave), fetch and summarize the homepage of llama.cpp"
See output w/ Hermes-3-Llama-3.1-8B
To compare the above results w/ a cloud provider's tool usage behaviour, just set the
--provider
flag (acceptsopenai
,together
,groq
) and/or use--endpoint
,--api-key
, and--model
TODOs before undrafting:
--special
for Nemo since last merge"all\n"
in non-tool-call outputs forllama-cli
(^|\n)?{"
and otherwise not trigger spuriously elsewhere.Command R Plus,DeepSeek)[TOOL_CALLS]
token<|python_tag|>
tokenthoughtful_steps
tool support from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 (using JSON structured output even with models not trained for tool calling)--cache-prompt
defaults to true; follow up will be to allow in-slot restoration and saving of cache, see this branch for instancechat_template
should maybe be resolved earlier? (now allama_chat_template
class)llama_apply_chat_template would benefit from a massive facelift. Maybe passing in a struct?(have introduced a new C++ APIllama_chat_template::apply
)llama_token_to_piece(ctx, token)
should really take(model, token)
instead, but that's a breaking API change_llama_token_to_piece
that takes model. Movedllama_chat_template_from_model
helper tocommon.cpp
builtin_tools
andtodays_date
in llama3.1's template)test-chat-templates
&test-minja
(write each test case in a.jinja
file)bos_token
in the current chat template logicexamples/tool-call
) from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389Possible follow ups: