Skip to content

Commit 0a4bfc5

Browse files
authored
Refactor + cleanup evals (#4)
1 parent d17ed14 commit 0a4bfc5

File tree

217 files changed

+1854
-55191
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

217 files changed

+1854
-55191
lines changed

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,6 @@ htmlcov/
5252
fastagent.secrets.yaml
5353
outputs/
5454
output*/
55-
fastagent.config.yaml
5655
fastagent.jsonl
5756
test_script_*.py
5857
.claude/

.gitmodules

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
[submodule "bfcl"]
2-
path = submodules/bfcl
3-
url = https://github.com/ShishirPatil/gorilla.git
1+
[submodule "bfcl-data"]
2+
path = tests/benchmarks/bfcl/data
3+
url = https://github.com/ShishirPatil/gorilla

README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,43 @@ mkdocs serve
182182
.venv/bin/mypy src/ --ignore-missing-imports
183183
```
184184

185+
## Running Benchmarks
186+
187+
WAGS includes evaluation support for the Berkeley Function Call Leaderboard (BFCL). To run benchmarks:
188+
189+
### Setup
190+
191+
If you cloned the repository without submodules, initialize them:
192+
193+
```bash
194+
# One-time setup: Initialize the data submodule
195+
git submodule update --init --recursive
196+
```
197+
198+
If you already have the repository set up, just ensure submodules are current:
199+
200+
```bash
201+
# Update to latest data
202+
git submodule update --remote
203+
```
204+
205+
### Run Tests
206+
207+
```bash
208+
# Run all BFCL tests
209+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py
210+
211+
# Run specific test
212+
.venv/bin/pytest 'tests/benchmarks/bfcl/test_bfcl.py::test_bfcl[multi_turn_base_121]'
213+
214+
# Run with specific model
215+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model gpt-4o
216+
```
217+
218+
For detailed information about:
219+
- **Benchmark architecture and test categories**: See [docs/benchmarks.md](docs/benchmarks.md)
220+
- **Test organization and patterns**: See [tests/README.md](tests/README.md)
221+
185222
## License
186223

187224
Apache 2.0

docs/benchmarks.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Running Benchmarks
2+
3+
<em class="wags-brand">wags</em> includes evaluation support for the [Berkeley Function Call Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html), enabling systematic testing of LLM function calling capabilities across multi-turn conversations.
4+
5+
## Setup
6+
7+
### First Time Setup
8+
9+
If you cloned the repository without submodules:
10+
11+
```bash
12+
# Initialize the data submodule
13+
git submodule update --init --recursive
14+
```
15+
16+
### Updating Data
17+
18+
If you already have the submodule initialized:
19+
20+
```bash
21+
# Update to latest test data
22+
git submodule update --remote
23+
```
24+
25+
## Running Tests
26+
27+
### Basic Usage
28+
29+
```bash
30+
# Run all BFCL multi-turn tests
31+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py
32+
33+
# Run specific test
34+
.venv/bin/pytest 'tests/benchmarks/bfcl/test_bfcl.py::test_bfcl[multi_turn_base_121]'
35+
36+
# Run test category
37+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py -k "multi_turn_miss_func"
38+
```
39+
40+
### With Different Models
41+
42+
```bash
43+
# Use GPT-4o (default)
44+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model gpt-4o
45+
46+
# Use Claude
47+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model claude-3-5-sonnet-20241022
48+
49+
# Use GPT-4o-mini
50+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model gpt-4o-mini
51+
```
52+
53+
### Custom Output Directory
54+
55+
```bash
56+
# Save results to specific directory
57+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --output-dir outputs/experiment1
58+
```
59+
60+
### Validation Mode
61+
62+
Validate existing logs without running new tests:
63+
64+
```bash
65+
# Validate logs from default directory
66+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --validate-only
67+
68+
# Validate logs from specific directory
69+
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --validate-only --log-dir outputs/experiment1/raw
70+
```
71+
72+
## Test Categories
73+
74+
- **multi_turn_base**: Standard multi-turn function calling (800 tests)
75+
- **multi_turn_miss_func**: Tests handling of missing function scenarios
76+
- **multi_turn_miss_param**: Tests handling of missing parameters
77+
- **multi_turn_long_context**: Context window stress tests with overwhelming information
78+
- **Memory tests**: Tests with key-value, vector, or recursive summarization backends
79+
80+
81+
## Developer Guide
82+
83+
1. **Discovery**: pytest collects tests from `loader.find_all_test_ids()`
84+
2. **Setup**: Creates MCP servers wrapping BFCL API classes using `uv run python`
85+
3. **Execution**: Runs multi-turn conversations with FastAgent
86+
4. **Serialization**: Saves complete message history to `complete.json`
87+
5. **Extraction**: Extracts tool calls from JSON (preserves what FastAgent drops)
88+
6. **Validation**: Uses BFCL validators to check correctness
89+
7. **Result**: Pass/fail based on `validation["valid"]`
90+
91+
## Further Reading
92+
93+
- **Test organization and patterns**: See [tests/README.md](../tests/README.md)
94+
- **BFCL leaderboard**: Visit [gorilla.cs.berkeley.edu](https://gorilla.cs.berkeley.edu/leaderboard.html)
95+
- **Official BFCL repository**: [github.com/ShishirPatil/gorilla](https://github.com/ShishirPatil/gorilla)

fastagent.config.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
default_model: "openai.gpt-4o-mini"
2+
3+
mcp:
4+
servers:
5+
minimal_test:
6+
command: python
7+
args: ["/Users/tapanc/dev/elicitation_evals/test_minimal_mcp.py"]
8+
9+
logger:
10+
level: debug
11+
type: file
12+
path: /Users/tapanc/dev/elicitation_evals/minimal_test.jsonl
13+
show_chat: true
14+
show_tools: true
15+
truncate_tools: false
16+
progress_display: false

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ nav:
4343
- Roots: middleware/roots.md
4444
- Elicitation: middleware/elicitation.md
4545
- Todo: middleware/todo.md
46+
- Benchmarks: benchmarks.md
4647

4748
plugins:
4849
- search

pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ dependencies = [
1616
"rich>=13.0.0",
1717
"jinja2>=3.0.0",
1818
"mpmath>=1.3.0", # TODO: Remove (only used for BFCL evals)
19-
2019
]
2120

2221
[project.scripts]
@@ -31,8 +30,12 @@ dev = [
3130
"pytest-asyncio>=0.21",
3231
"black",
3332
"ruff",
33+
"bfcl-eval",
3434
]
3535

36+
[tool.uv.sources]
37+
bfcl-eval = { git = "https://github.com/chughtapan/gorilla.git", subdirectory = "berkeley-function-call-leaderboard", branch = "wags-dev" }
38+
3639
[tool.black]
3740
line-length = 100
3841
target-version = ['py313']
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
mcp:
2+
servers:
3+
github:
4+
transport: stdio
5+
command: wags
6+
args:
7+
- run
8+
- servers/github
9+
env:
10+
GITHUB_PERSONAL_ACCESS_TOKEN: ${GITHUB_PERSONAL_ACCESS_TOKEN}
11+
roots:
12+
- uri: file:///Users/tapanc/dev/wags
13+
name: "Local Development Folder"
14+
- uri: https://github.com/anthropics/courses
15+
name: "Anthropics Courses"
16+
- uri: https://github.com/modelcontextprotocol/
17+
name: "MCP Organization"

src/evals/README.md

Lines changed: 0 additions & 119 deletions
This file was deleted.

src/evals/__init__.py

Lines changed: 0 additions & 12 deletions
This file was deleted.

0 commit comments

Comments
 (0)