Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

ochafik · 2024-09-25T15:37:26Z

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Background

It tackles two main problems related to tool calling:

Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.
- Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).
  - For Llama3.1-Instruct (cf. llama-stack-apps repo / these docs) for instance, triggers are <|python_tag|> and {"name": "toolN" (for each toolN in the list of tools in the request).
  - For Llama3.2-Instruct, we eagerly trigger on{" which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.
  - For Hermes Pro (cf. Hermes-Function-Calling repo), it's <tool_call>.
  - For Functionary v3.llama3, it's >>>toolN\n for each toolN.
  - For Functionary v3-llama3.1, it's <function= and <|python_tag|>
  - For Mistral Nemo, the trigger ought to be [TOOL_CALLS] but it doesn't seem to (ever?) be emitted, so we're triggering on {" instead for now.
  - For other models ("generic" tool call style), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)
Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
- Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the parts of this PR that could possibly be sent separately (~~currently itemized~~ to be reitemized as commits):

grammar_trigger_words + llama_antiprompts: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligning cli & server (e.g. single-token stop logic) and handling grammar trigger words.
minja.hpp + test/{test-minja.cpp,update_jinja_goldens.py,chat/{contexts,templates,goldens}}: minimal Jinja templating engine and its tests against actual templates & a few test contexts (now in its own repo: https://github.com/google/minja)
Tool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro
Integration in llama-server (fenced by --jinja) w/ tools, tool_choice support + updated response_format compliance.
Minimal examples/agent with a tool call / action loop, barebones tools and instructions / support to run them in a siloed docker container (see usage below)

How to use / test

While any model should work (using generic support based on JSON schema constraints), this PR supports the native call style of a few models:

Llama 3.x
Functionary 3.x
Hermes 2/3, Qwen 2.5
Mistral Nemo.

For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the tool_use variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspecting http://localhost:8080/props, and inspect the logs for Tool call style: .

Here's how to run an agent w/ local tool call:

Install prerequisite: uv (used to simplify python deps)

Run llama-server w/ any model:

make -j LLAMA_CURL=1 llama-server

# Native support for Mistral Nemo, Qwen 2.5, Hermes 3, Functionary 3.x
# Note that some of these GGUFs lack the right template, so we override it
# (otherwise they'd use the generic tool call support, which may be less efficient
# and consume more tokens)

./llama-server --jinja -fa -ctk q4_0 -ctv q4_0 --verbose \
  -hfr bartowski/Qwen2.5-7B-Instruct-GGUF -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf

./llama-server --jinja -fa -ctk q4_0 -ctv q4_0 --verbose \
  -hfr NousResearch/Hermes-3-Llama-3.1-8B-GGUF -hff Hermes-3-Llama-3.1-8B.Q4_K_M.gguf \
  --chat-template-file <( python scripts/get_hf_chat_template.py NousResearch/Hermes-3-Llama-3.1-8B tool_use )

./llama-server --jinja -fa --verbose \
  -hfr meetkai/functionary-small-v3.2-GGUF -hff functionary-small-v3.2.Q8_0.gguf \
  --chat-template-file <( python scripts/get_hf_chat_template.py meetkai/functionary-medium-v3.2 )

./llama-server --jinja -fa --verbose \
  -hfr lmstudio-community/Llama-3.2-3B-Instruct-GGUF -hff Llama-3.2-3B-Instruct-Q6_K.gguf \
  --chat-template-file <( python scripts/get_hf_chat_template.py meta-llama/Llama-3.2-3B-Instruct )

# Note the --special flag: this is needed b/c of a regression from the last merge, will fix!
./llama-server --jinja -fa -ctk q8_0 -ctv q8_0 --verbose --special \
  -hfr bartowski/Mistral-Nemo-Instruct-2407-GGUF -hff Mistral-Nemo-Instruct-2407-Q8_0.gguf \
  --chat-template-file <( python scripts/get_hf_chat_template.py mistralai/Mistral-Nemo-Instruct-2407 )

# Generic support, e.g. Phi 3.5, Gemma 2b, but really anything goes

./llama-server --jinja -fa --verbose \
  -hfr bartowski/Phi-3.5-mini-instruct-GGUF -hff Phi-3.5-mini-instruct-Q4_K_M.gguf

./llama-server --jinja -fa --verbose \
  -hfr bartowski/gemma-2-2b-it-GGUF -hff gemma-2-2b-it-Q4_K_M.gguf

Run the tools in examples/agent/tools inside a docker container for some level of isolation (+ sneaky logging of outgoing http and https traffic: you wanna watch over those agents' shoulders for the time being 🧐). Check http://localhost:8088/docs to see the tools exposed.
```
export BRAVE_SEARCH_API_KEY=... # Get one at https://api.search.brave.com/
./examples/agent/serve_tools_inside_docker.sh
```
[!WARNING]
The command above gives tools (and your agent) access to the web (and read-only access to examples/agent/**. You can loosen / restrict web access in examples/agent/squid/conf/squid.conf.

Run the agent with some goal

uv run examples/agent/run.py "What is the sum of 2535 squared and 32222000403?"

See output w/ Hermes-3-Llama-3.1-8B

🛠️  Tools: python, fetch_page, brave_search
⚙️  python(code="print(2535**2 + 32222000403)")
→ 15 chars
The sum of 2535 squared and 32222000403 is 32228426628.

uv run examples/agent/run.py "What is the best BBQ joint in Laguna Beach?"

See output w/ Hermes-3-Llama-3.1-8B

🛠️  Tools: python, fetch_page, brave_search
⚙️  brave_search(query="best bbq joint in laguna beach")
→ 4283 chars
Based on the search results, Beach Pit BBQ seems to be a popular and highly-rated BBQ joint in Laguna Beach. They offer a variety of BBQ options, including ribs, pulled pork, brisket, salads, wings, and more. They have dine-in, take-out, and catering options available.

uv run examples/agent/run.py "Search (with brave), fetch and summarize the homepage of llama.cpp"

See output w/ Hermes-3-Llama-3.1-8B

🛠️  Tools: python, fetch_page, brave_search
⚙️  brave_search(query="llama.cpp")
→ 3330 chars
Llama.cpp is an open-source software library written in C++ that performs inference on various Large Language Models (LLMs). Alongside the library, it includes a CLI and web server. It is co-developed alongside the GGML project, a general-purpose tensor library. Llama.cpp is also available with Python bindings, known as llama.cpp-python. It has gained popularity for its ability to run LLMs on local machines, such as Macs with NVIDIA RTX systems. Users can leverage this library to accelerate LLMs and integrate them into various applications. There are numerous resources available, including tutorials and guides, for getting started with Llama.cpp and llama.cpp-python.

To compare the above results w/ a cloud provider's tool usage behaviour, just set the --provider flag (accepts openai, together, groq) and/or use --endpoint, --api-key, and --model

export LLAMA_API_KEY=...      # for --provider=llama.cpp https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
export OPENAI_API_KEY=...     # for --provider=openai    https://platform.openai.com/api-keys
export TOGETHER_API_KEY=...   # for --provider=together  https://api.together.ai/settings/api-keys
export GROQ_API_KEY=...       # for --provider=groq      https://console.groq.com/keys
uv run examples/agent/run.py "Search for, fetch and summarize the homepage of llama.cpp" --provider=openai

TODOs before undrafting:

Possible follow ups:

Add tool call loop to the default web chat using Pyodide as a python interpreter?

…enerators + parsers

…late

…date_model_chat_template

…f taking server down

…ormat_chat

…taking server down

…nja templates!

…+ docling (pdf -> markdown), sparql for dbpedia and wikidata

…rmes tool use template)

ngxson · 2024-12-01T22:00:41Z

Hey @ochafik , this is impressive! It's a nice idea to bring a jinja parser into llama.cpp

I'm interested in this direction. But the current PR is quite big to review. Do you think it's possible to split the jinja part into a dedicated PR?

Btw me, @Vaibhavs10 and @Rocketknight1 (Matt) can help to further improve the jinja implementation. My suggestions are:

We can have a first "it just work" version
Then, we can run that version on a set of known jinja templates on Hugging Face hub to see how many percentage can be parsed
Base on the result, we can decide if:
- We should further improve the jinja engine
- Or, having jinja + old heuristic method co-exist together

ochafik · 2024-12-04T16:09:14Z

Hey @ngxson, thanks for the enthusiasm!

As it turns out, I've just got the approvals today (🎉) from my employer to launch Minja in its own repo → https://github.com/google/minja (this way I'll be able to setup more tests - including fuzzing - and distinct CI and take some of the complexity away from this PR & llama.cpp in general - just copy minja.hpp as we do for json.hpp and httplib.h)

I'll resume updates to this PR, I've been experimenting along the lines of some of @ggerganov 's comments but needs more work.

run that version on a set of known jinja templates on Hugging Face hub to see how many percentage can be parsed

Here's the list of tested model templates: https://github.com/google/minja/blob/main/tests/CMakeLists.txt#L22

I'd love suggestions of additional models, feel free to open a bug or PR there with either things that work or things that don't.

…-temperature, fix --seed

ochafik added 6 commits September 25, 2024 16:14

grammar: trigger words + refactor of antiprompts

5b6d504

minja: minimalist Jinja templating engine for LLM chat templates

eaca756

json: build_grammar helper

26c175b

tool-call: basic Functionary 3.2, Llama 3.1, Hermes 2 Pro grammar g…

3cfc21e

…enerators + parsers

tool-call: integrate minja & tool-call to server when --jinja is set

e309c6a

server: add --chat-template-file

41103c0

github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024

tool-call: support Functionary v3 vs. v3-llama3.1 variants

4706bdb

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024

tool-call: add basic usage example to server readme

8f25531

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024

ochafik added 16 commits September 25, 2024 18:58

Merge remote-tracking branch 'origin/master' into tool-call

33ea20e

tool-call: add output example to readme

d15dcfb

minja: fetch more templates (add models from test-chat-template)

97d0620

tool-call: fix llama_chat_apply_template signature / test-chat-temp…

e983c9d

…late

minja: fix llama_chat_apply_template + adde use_jinja param to vali…

45b243b

…date_model_chat_template

server: fix tailing comma in completions_seed

9e366b3

tool-call: add server tests for llama 3.1

a774093

server: catch errors in oaicompat_completion_params_parse instead o…

d928ff4

…f taking server down

tool-call: allow empty message content when there's tool_calls in f…

ab25e3f

…ormat_chat

fix editorconfig lints

1b62801

fix flake8 lints

76d2938

minja: add str.endswith

c124ab4

tool-call: fix/test functionary v3

595e11c

server: catch errors in format_final_response_oaicompat instead of …

94377d7

…taking server down

minja: try to please gcc

059babd

tool-call: fix pyright type errors

4cd82d6

ochafik added 12 commits October 31, 2024 13:52

tool-call: code_interpreter & system + tool call support for all ji…

f5b7825

…nja templates!

tool-call: don't use -fa w/ Mistral-Nemo (hard crashes?)

c773516

tool-call: add LLAMA_UPDATE_GOLDENS env for test-chat-template

b35aa4a

tool-call: functionary-small-v3.2 test now green

9477c54

Update README.md

c4a8050

nits

f5f7475

Update README.md

fe967b6

tool-call: fix qwen template test

479c152

agent: add missing tool name in response!

bc52c0a

agent: memorize, search_memory (sqlite-vec + sqlite-lembed), fetch …

c059aec

…+ docling (pdf -> markdown), sparql for dbpedia and wikidata

minja: don't explode upon referencing a field on an array (fixes He…

5789f69

…rmes tool use template)

Update README.md

f9b1969

ngxson mentioned this pull request Nov 20, 2024

server: (web UI) Add custom chat formatting that uses input_prefix and input_suffix #10425

Closed

4 tasks

slaren mentioned this pull request Dec 1, 2024

Add mistral-v1, mistral-v3, mistral-v3-tekken and mistral-v7 chat template types #10572

Merged

4 tasks

ochafik added 6 commits December 5, 2024 21:32

agent: add --think "tool", default to local tools endpoint, support -…

adc673c

…-temperature, fix --seed

Merge remote-tracking branch 'origin/master' into tool-call

1afa312

agent: more robust squid config

30fbcb2

agent: update readme

a469f53

minja: remove tests (now in https://github.com/google/minja)

cbe395d

Update README.md

1fd5f1a

ochafik mentioned this pull request Dec 6, 2024

Quantitative testing of models on HF google/minja#6

Open

ochafik added 7 commits December 7, 2024 02:15

minja: sync @ google/minja@916c181

5d0033f

tool-call: add firefunction-v2 style

1f0b157

tool-calls: migrate tests to pytest

93a5245

Merge remote-tracking branch 'origin/master' into tool-call

055053c

tool-calls: shorter name: grammar_triggers

1e2115f

Merge remote-tracking branch 'origin/master' into tool-call

7bfcd0a

tool-call: stabilize server tests

7e3feff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

ochafik commented Sep 25, 2024 •

edited

Loading

ngxson commented Dec 1, 2024

ochafik commented Dec 4, 2024

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Are you sure you want to change the base?

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Conversation

ochafik commented Sep 25, 2024 • edited Loading

Background

How to use / test

TODOs before undrafting:

Possible follow ups:

ngxson commented Dec 1, 2024

ochafik commented Dec 4, 2024

ochafik commented Sep 25, 2024 •

edited

Loading