Skip to content

AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.

License

Notifications You must be signed in to change notification settings

benchflow-ai/benchflow

Repository files navigation

BenchFlow

PyPI - License PyPI - Downloads PyPI - Version GitHub Actions Workflow Status Documentation

BenchFlow is an Open-source Benchmark Hub and Eval Infra for AI production and benchmark developers.

BenchFlow Diagram

Overview

BenchFlow.Demo.2.0-720p-50fps.mp4

Within the dashed box, you will find the interfaces (BaseAgent, BenchClient) provided by BenchFlow. For benchmark users, you are required to extend and implement the BaseAgent interface to interact with the benchmark. The call_api method supplies a step_input which provides the input for each step of a task (a task may have one or more steps). BenchFlow Overview

Quick Start For Benchmark Users

Before you start, please get your BF_TOKEN on BenchFlow.ai.

  1. Install BenchFlow

    git clone https://github.com/benchflow-ai/benchflow.git
    cd benchflow
    pip install -e .
  2. Browse Benchmarks

    Find benchmarks tailored to your needs on our Benchmark Hub.

  3. Implement Your Agent

    Extend the BaseAgent interface:

    def call_api(self, task_step_inputs: Dict[str, Any]) -> str:
        pass

    Optional: You can include a requirements.txt file to install additional dependencies, such as openai and requests.

  4. Test Your Agent

    Here is a quick example to run your agent:

    import os
    from benchflow import load_benchmark
    from benchflow.agents.webarena_openai import WebarenaAgent
    
    # The benchmark name follows the schema: org_name/benchmark_name.
    # You can obtain the benchmark name from the Benchmark Hub.
    bench = load_benchmark(benchmark_name="benchflow/webarena", bf_token=os.getenv("BF_TOKEN"))
    
    your_agents = WebarenaAgent()
    
    run_ids = bench.run(
        task_ids=[0],
        agents=your_agents,
        api={"provider": "openai", "model": "gpt-4o-mini", "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")},
        requirements_txt="webarena_requirements.txt",
        args={}
    )
    
    results = bench.get_results(run_ids)

Quick Start for Benchmark Developers

  1. Install BenchFlow

    Install BenchFlow via pip:

    pip install benchflow
  2. Embed BenchClient into Your Benchmark Evaluation Scripts

    Refer to this example for how MMLU-Pro integrates BenchClient.

  3. Containerize Your Benchmark and Upload the Image to Dockerhub

    Ensure your benchmark can run in a single container without any additional steps. Below is an example Dockerfile for MMLU-Pro:

    FROM python:3.11-slim
    
    COPY . /app
    WORKDIR /app
    COPY scripts/entrypoint.sh /app/entrypoint.sh
    
    RUN chmod +x /app/entrypoint.sh
    RUN pip install -r requirements.txt
    
    ENTRYPOINT ["/app/entrypoint.sh"]
  4. Extend BaseBench to Run Your Benchmarks

    See this example for how MMLU-Pro extends BaseBench

  5. Upload Your Benchmark into BenchFlow

    Go to the Benchmark Hub and click on +new benchmarks to upload your benchmark Git repository. Make sure you place the benchflow_interface.py file at the root of your project.


License

This project is licensed under the MIT License.

About

AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages