τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Paper: https://arxiv.org/abs/2406.12045

Citation:

@misc{yao2024tau,
      title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains}, 
      author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
      year={2024},
      eprint={2406.12045},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2406.12045}, 
}

Leaderboard

Strategy	Pass^1	Pass^2	Pass^3	Pass^4
TC (gpt-4o)	0.380	0.237	0.185	0.160
TC (claude-3-5-sonnet-20240620)	??	??	??	??
TC (mistral-large-2407)	??	??	??	??
TC (gpt-4o-mini)	0.225	0.140	0.110	0.100
Act (gpt-4o)	0.365	0.217	0.160	0.140
ReAct (gpt-4o)	0.325	0.233	0.185	0.160

*TC = tool-calling strategy (the function-calling strategy reported in the paper)

Setup

Clone this repository:

git clone https://github.com/sierra-research/tau-bench && cd ./tau-bench

Install from source (which also installs required packages):

pip install -e .

Set up your OpenAI / Anthropic / Google / Mistral / AnyScale API keys as environment variables.

OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
MISTRAL_API_KEY=...

Run

Run a tool-calling agent on the τ-retail environment:

python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10

Set max concurrency according to your API limit(s).

User simulators

By default, we use gpt-4o as the user simulator with strategy llm. You can use other models by setting the --user-model flag, or other strategies by setting the --user-strategy flag. For example, run a tool-calling agent with a claude user simulator:

python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --max-concurrency 10 --user-model claude-3-5-sonnet-20240620 --user-model-provider anthropic --user-strategy llm

Other strategies:

To run react user simulator:

python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --max-concurrency 10 --user-model gpt-4o --user-model-provider openai --user-strategy react

Example of a react user response:

Thought:
I should provide my name and zip code as I wasn't given an email address to use.

User Response:
Sure, my name is Yusuf Rossi, and my zip code is 19122.

To run verify user simulator:

python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --max-concurrency 10 --user-model gpt-4o --user-model-provider openai --user-strategy verify

This strategy uses a subsequent LLM verification step to check if the user simulator's response is satisfactory. If not, the user simulator will be prompted to generate a new response.

To run reflection user simulator:

python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --max-concurrency 10 --user-model gpt-4o --user-model-provider openai --user-strategy reflection

This strategy uses a subsequent LLM verification step to check if the user simulator's response is satisfactory. If not, the user simulator will be prompted to reflect on its response and generate a new response.

Auto error identification

Often times, it is difficult and time consuming to manually identify specific error locations in trajectories as they can be long and the constraints can be complex. We have provided an auto error identification tool that can do the following:

Fault assignment: determine the entity that is responsible for the fault (user, agent, environment)
Fault type classification: classify the type of fault (goal_partially_completed, used_wrong_tool, used_wrong_tool_argument, took_unintended_action)

Both of the labels are accompanied with a description.

To run the auto error identification, run:

python auto_error_identification.py --env <airline/retail> --results-path <the path to your results file here> --max-concurrency 16 --output-path test-auto-error-identification -n 10

Please note that this feature utilizes an LLM, which may lead to inaccurate error identifications.

*Notice: If an error is raised due to the structure of your results file, you may have to rerun the benchmark to produce a new results file. We have recently rewritten the benchmark to be more type-safe and extensible.

License

See ./LICENSE.

Contact

Please submit issues or pull requests if you find problems with the benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
tau_bench		tau_bench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
auto_error_identification.py		auto_error_identification.py
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Leaderboard

Setup

Run

User simulators

Auto error identification

License

Contact

About

Releases

Packages

Contributors 7

Languages

License

sierra-research/tau-bench

Folders and files

Latest commit

History

Repository files navigation

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Leaderboard

Setup

Run

User simulators

Auto error identification

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages