|
| 1 | +# MCP-Universe Repository Management Benchmark Integration |
| 2 | + |
| 3 | +This directory contains the integration of the MCP-Universe repository management benchmark into WAGS. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +MCP-Universe is a comprehensive benchmark from Salesforce AI Research that evaluates LLMs on realistic tasks using real-world MCP servers. This integration focuses on the **repository management domain** with: |
| 8 | + |
| 9 | +- **28 pure GitHub tasks** (github_task_0001 through github_task_0030, excluding 0013 and 0020) |
| 10 | +- Tests realistic GitHub operations including: |
| 11 | + |
| 12 | +- Creating repositories and branches |
| 13 | +- Managing files and commits |
| 14 | +- Creating pull requests |
| 15 | +- Copying files between repositories |
| 16 | +- Managing issues and labels |
| 17 | + |
| 18 | +## Quick Start |
| 19 | + |
| 20 | +### Prerequisites |
| 21 | + |
| 22 | +1. **Docker** - REQUIRED to run the GitHub MCP server |
| 23 | + - Install Docker Desktop: https://www.docker.com/products/docker-desktop |
| 24 | + - **Start Docker Desktop** before running tests |
| 25 | + - Verify installation: `docker --version` |
| 26 | + - **Note**: Using pinned version v0.15.0 for research reproducibility (before PR #1091 which added automatic instruction generation) |
| 27 | + - If `docker` command is not found, ensure Docker Desktop is running and restart your terminal |
| 28 | +2. **GitHub Personal Access Token** - For GitHub API access |
| 29 | + - **CRITICAL**: Use a dedicated test GitHub account for safety |
| 30 | + - Create token: https://github.com/settings/tokens |
| 31 | +3. **OpenAI API Key** (or Anthropic for Claude models) - For running the LLM agent |
| 32 | +4. **Python 3.13+** with [uv](https://docs.astral.sh/uv/) |
| 33 | + |
| 34 | +### Installation |
| 35 | + |
| 36 | +```bash |
| 37 | +# Clone the repository (if not already done) |
| 38 | +git clone https://github.com/chughtapan/wags.git |
| 39 | +cd wags |
| 40 | + |
| 41 | +# Install dependencies (pulls the forked MCP-Universe package via eval extras) |
| 42 | +UV_GIT_LFS=1 uv pip install -e ".[dev,evals]" |
| 43 | + |
| 44 | +# Verify Docker is working |
| 45 | +docker run --rm hello-world |
| 46 | + |
| 47 | +# Pre-pull the GitHub MCP server image (recommended for faster test startup) |
| 48 | +docker pull ghcr.io/github/github-mcp-server:v0.15.0 |
| 49 | +``` |
| 50 | + |
| 51 | +**Note**: The `.[dev,evals]` extras install: |
| 52 | +- `mcpuniverse` from the fork [`vinamra57/MCP-Universe@72389d8`](https://github.com/vinamra57/MCP-Universe/tree/72389d8a04044dceb855f733a938d0344ac58813), which removes heavy 3D dependencies while keeping the repository-management configs |
| 53 | +- `bfcl-eval` for Berkeley Function Call Leaderboard evaluation |
| 54 | +- Other shared evaluation dependencies |
| 55 | + |
| 56 | +All repository management task JSON files are bundled inside the installed `mcpuniverse` wheel, so no git submodules or manual data checkout are required. |
| 57 | + |
| 58 | +### Configuration |
| 59 | + |
| 60 | +Environment variables are automatically loaded from `servers/github/.env`. Create this file with: |
| 61 | + |
| 62 | +```bash |
| 63 | +# servers/github/.env |
| 64 | +GITHUB_PERSONAL_ACCESS_TOKEN=your_github_token_here |
| 65 | +GITHUB_PERSONAL_ACCOUNT_NAME=your_github_username |
| 66 | +OPENAI_API_KEY=your_openai_key_here |
| 67 | +``` |
| 68 | + |
| 69 | +**IMPORTANT**: Use a dedicated test GitHub account. The AI agent will perform real operations on GitHub repositories. |
| 70 | + |
| 71 | +Alternatively, you can manually export the environment variables: |
| 72 | + |
| 73 | +```bash |
| 74 | +export GITHUB_PERSONAL_ACCESS_TOKEN="your_github_token_here" |
| 75 | +export GITHUB_PERSONAL_ACCOUNT_NAME="your_github_username" |
| 76 | +export OPENAI_API_KEY="your_openai_key_here" |
| 77 | +``` |
| 78 | + |
| 79 | +### Running Tests |
| 80 | + |
| 81 | +Run all 28 repository management tasks: |
| 82 | + |
| 83 | +```bash |
| 84 | +uv run pytest tests/benchmarks/mcp_universe/test_mcp_universe.py \ |
| 85 | + --model gpt-4o-mini \ |
| 86 | + --output-dir outputs/mcp_universe \ |
| 87 | + -v |
| 88 | +``` |
| 89 | + |
| 90 | +Run a single task: |
| 91 | + |
| 92 | +```bash |
| 93 | +uv run pytest tests/benchmarks/mcp_universe/test_mcp_universe.py::test_mcp_universe[github_task_0001] \ |
| 94 | + --model gpt-4o-mini \ |
| 95 | + --output-dir outputs/mcp_universe \ |
| 96 | + -v |
| 97 | +``` |
| 98 | + |
| 99 | +Run with different models: |
| 100 | + |
| 101 | +```bash |
| 102 | +# Use GPT-4o |
| 103 | +uv run pytest tests/benchmarks/mcp_universe/test_mcp_universe.py \ |
| 104 | + --model gpt-4o \ |
| 105 | + --output-dir outputs/mcp_universe |
| 106 | + |
| 107 | +# Use Claude (requires ANTHROPIC_API_KEY) |
| 108 | +uv run pytest tests/benchmarks/mcp_universe/test_mcp_universe.py \ |
| 109 | + --model claude-3-5-sonnet-20241022 \ |
| 110 | + --output-dir outputs/mcp_universe |
| 111 | +``` |
| 112 | + |
| 113 | +### Validate Mode |
| 114 | + |
| 115 | +If you have existing output files, you can validate them without re-running the agent: |
| 116 | + |
| 117 | +```bash |
| 118 | +uv run pytest tests/benchmarks/mcp_universe/test_mcp_universe.py \ |
| 119 | + --validate-only \ |
| 120 | + --log-dir outputs/mcp_universe/raw \ |
| 121 | + --output-dir outputs/mcp_universe |
| 122 | +``` |
0 commit comments