BenchFlow

BenchFlow is an Open-source Benchmark Hub and Eval Infra for AI production and benchmark developers.

Overview

BenchFlow.Demo.2.0-720p-50fps.mp4

Within the dashed box, you will find the interfaces (BaseAgent, BenchClient) provided by BenchFlow. For benchmark users, you are required to extend and implement the BaseAgent interface to interact with the benchmark. The call_api method supplies a step_input which provides the input for each step of a task (a task may have one or more steps).

Quick Start For Benchmark Users

Before you start, please get your BF_TOKEN on BenchFlow.ai.

Install BenchFlow

git clone https://github.com/benchflow-ai/benchflow.git
cd benchflow
pip install -e .

Browse Benchmarks

Find benchmarks tailored to your needs on our Benchmark Hub.
Implement Your Agent

Extend the BaseAgent interface:
```
def call_api(self, task_step_inputs: Dict[str, Any]) -> str:
    pass
```
Optional: You can include a requirements.txt file to install additional dependencies, such as openai and requests.

Test Your Agent

Here is a quick example to run your agent:

import os
from benchflow import load_benchmark
from benchflow.agents.webarena_openai import WebarenaAgent

# The benchmark name follows the schema: org_name/benchmark_name.
# You can obtain the benchmark name from the Benchmark Hub.
bench = load_benchmark(benchmark_name="benchflow/webarena", bf_token=os.getenv("BF_TOKEN"))

your_agents = WebarenaAgent()

run_ids = bench.run(
    task_ids=[0],
    agents=your_agents,
    api={"provider": "openai", "model": "gpt-4o-mini", "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")},
    requirements_txt="webarena_requirements.txt",
    args={}
)

results = bench.get_results(run_ids)

Quick Start for Benchmark Developers

Install BenchFlow

Install BenchFlow via pip:
```
pip install benchflow
```
Embed BenchClient into Your Benchmark Evaluation Scripts

Refer to this example for how MMLU-Pro integrates BenchClient.

Containerize Your Benchmark and Upload the Image to Dockerhub

Ensure your benchmark can run in a single container without any additional steps. Below is an example Dockerfile for MMLU-Pro:

FROM python:3.11-slim

COPY . /app
WORKDIR /app
COPY scripts/entrypoint.sh /app/entrypoint.sh

RUN chmod +x /app/entrypoint.sh
RUN pip install -r requirements.txt

ENTRYPOINT ["/app/entrypoint.sh"]

Extend BaseBench to Run Your Benchmarks

See this example for how MMLU-Pro extends BaseBench
Upload Your Benchmark into BenchFlow

Go to the Benchmark Hub and click on +new benchmarks to upload your benchmark Git repository. Make sure you place the benchflow_interface.py file at the root of your project.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/benchflow		src/benchflow
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
backlog.md		backlog.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchFlow

Overview

Quick Start For Benchmark Users

Quick Start for Benchmark Developers

License

About

Releases 2

Packages

Contributors 4

Languages

License

benchflow-ai/benchflow

Folders and files

Latest commit

History

Repository files navigation

BenchFlow

Overview

Quick Start For Benchmark Users

Quick Start for Benchmark Developers

License

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages