BenchFlow is an Open-source Benchmark Hub and Eval Infra for AI production and benchmark developers.

BenchFlow.Demo.2.0-720p-50fps.mp4
Within the dashed box, you will find the interfaces (BaseAgent, BenchClient) provided by BenchFlow. For benchmark users, you are required to extend and implement the BaseAgent interface to interact with the benchmark. The call_api
method supplies a step_input
which provides the input for each step of a task (a task may have one or more steps).
Before you start, please get your BF_TOKEN
on BenchFlow.ai.
-
Install BenchFlow
git clone https://github.com/benchflow-ai/benchflow.git cd benchflow pip install -e .
-
Browse Benchmarks
Find benchmarks tailored to your needs on our Benchmark Hub.
-
Implement Your Agent
Extend the BaseAgent interface:
def call_api(self, task_step_inputs: Dict[str, Any]) -> str: pass
Optional: You can include a
requirements.txt
file to install additional dependencies, such asopenai
andrequests
. -
Test Your Agent
Here is a quick example to run your agent:
import os from benchflow import load_benchmark from benchflow.agents.webarena_openai import WebarenaAgent # The benchmark name follows the schema: org_name/benchmark_name. # You can obtain the benchmark name from the Benchmark Hub. bench = load_benchmark(benchmark_name="benchflow/webarena", bf_token=os.getenv("BF_TOKEN")) your_agents = WebarenaAgent() run_ids = bench.run( task_ids=[0], agents=your_agents, api={"provider": "openai", "model": "gpt-4o-mini", "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")}, requirements_txt="webarena_requirements.txt", args={} ) results = bench.get_results(run_ids)
-
Install BenchFlow
Install BenchFlow via pip:
pip install benchflow
-
Embed
BenchClient
into Your Benchmark Evaluation ScriptsRefer to this example for how MMLU-Pro integrates
BenchClient
. -
Containerize Your Benchmark and Upload the Image to Dockerhub
Ensure your benchmark can run in a single container without any additional steps. Below is an example Dockerfile for MMLU-Pro:
FROM python:3.11-slim COPY . /app WORKDIR /app COPY scripts/entrypoint.sh /app/entrypoint.sh RUN chmod +x /app/entrypoint.sh RUN pip install -r requirements.txt ENTRYPOINT ["/app/entrypoint.sh"]
-
Extend
BaseBench
to Run Your BenchmarksSee this example for how MMLU-Pro extends
BaseBench
-
Upload Your Benchmark into BenchFlow
Go to the Benchmark Hub and click on
+new benchmarks
to upload your benchmark Git repository. Make sure you place thebenchflow_interface.py
file at the root of your project.
This project is licensed under the MIT License.