chughtapan
diff --git a/‎.gitignore‎
Lines changed: 0 additions & 1 deletion b/‎.gitignore‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎.gitmodules‎
Lines changed: 3 additions & 3 deletions b/‎.gitmodules‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 37 additions & 0 deletions b/‎README.md‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎docs/benchmarks.md‎
Lines changed: 95 additions & 0 deletions b/‎docs/benchmarks.md‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎fastagent.config.yaml‎
Lines changed: 16 additions & 0 deletions b/‎fastagent.config.yaml‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions b/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 4 additions & 1 deletion b/‎pyproject.toml‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎servers/github/fastagent.config.yaml‎
Lines changed: 17 additions & 0 deletions b/‎servers/github/fastagent.config.yaml‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎src/evals/README.md‎
Lines changed: 0 additions & 119 deletions b/‎src/evals/README.md‎
Lines changed: 0 additions & 119 deletions
diff --git a/‎src/evals/__init__.py‎
Lines changed: 0 additions & 12 deletions b/‎src/evals/__init__.py‎
Lines changed: 0 additions & 12 deletions
@@ -52,7 +52,6 @@ htmlcov/
 fastagent.secrets.yaml
 outputs/
 output*/
-fastagent.config.yaml
 fastagent.jsonl
 test_script_*.py
 .claude/
 
@@ -1,3 +1,3 @@
-[submodule "bfcl"]
-	path = submodules/bfcl
-	url = https://github.com/ShishirPatil/gorilla.git
+[submodule "bfcl-data"]
+	path = tests/benchmarks/bfcl/data
+	url = https://github.com/ShishirPatil/gorilla
@@ -182,6 +182,43 @@ mkdocs serve
 .venv/bin/mypy src/ --ignore-missing-imports
 ```
 
+## Running Benchmarks
+
+WAGS includes evaluation support for the Berkeley Function Call Leaderboard (BFCL). To run benchmarks:
+
+### Setup
+
+If you cloned the repository without submodules, initialize them:
+
+```bash
+# One-time setup: Initialize the data submodule
+git submodule update --init --recursive
+```
+
+If you already have the repository set up, just ensure submodules are current:
+
+```bash
+# Update to latest data
+git submodule update --remote
+```
+
+### Run Tests
+
+```bash
+# Run all BFCL tests
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py
+
+# Run specific test
+.venv/bin/pytest 'tests/benchmarks/bfcl/test_bfcl.py::test_bfcl[multi_turn_base_121]'
+
+# Run with specific model
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model gpt-4o
+```
+
+For detailed information about:
+- **Benchmark architecture and test categories**: See [docs/benchmarks.md](docs/benchmarks.md)
+- **Test organization and patterns**: See [tests/README.md](tests/README.md)
+
 ## License
 
 Apache 2.0
@@ -0,0 +1,95 @@
+# Running Benchmarks
+
+<em class="wags-brand">wags</em> includes evaluation support for the [Berkeley Function Call Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html), enabling systematic testing of LLM function calling capabilities across multi-turn conversations.
+
+## Setup
+
+### First Time Setup
+
+If you cloned the repository without submodules:
+
+```bash
+# Initialize the data submodule
+git submodule update --init --recursive
+```
+
+### Updating Data
+
+If you already have the submodule initialized:
+
+```bash
+# Update to latest test data
+git submodule update --remote
+```
+
+## Running Tests
+
+### Basic Usage
+
+```bash
+# Run all BFCL multi-turn tests
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py
+
+# Run specific test
+.venv/bin/pytest 'tests/benchmarks/bfcl/test_bfcl.py::test_bfcl[multi_turn_base_121]'
+
+# Run test category
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py -k "multi_turn_miss_func"
+```
+
+### With Different Models
+
+```bash
+# Use GPT-4o (default)
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model gpt-4o
+
+# Use Claude
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model claude-3-5-sonnet-20241022
+
+# Use GPT-4o-mini
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model gpt-4o-mini
+```
+
+### Custom Output Directory
+
+```bash
+# Save results to specific directory
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --output-dir outputs/experiment1
+```
+
+### Validation Mode
+
+Validate existing logs without running new tests:
+
+```bash
+# Validate logs from default directory
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --validate-only
+
+# Validate logs from specific directory
+.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --validate-only --log-dir outputs/experiment1/raw
+```
+
+## Test Categories
+
+- **multi_turn_base**: Standard multi-turn function calling (800 tests)
+- **multi_turn_miss_func**: Tests handling of missing function scenarios
+- **multi_turn_miss_param**: Tests handling of missing parameters
+- **multi_turn_long_context**: Context window stress tests with overwhelming information
+- **Memory tests**: Tests with key-value, vector, or recursive summarization backends
+
+
+## Developer Guide
+
+1. **Discovery**: pytest collects tests from `loader.find_all_test_ids()`
+2. **Setup**: Creates MCP servers wrapping BFCL API classes using `uv run python`
+3. **Execution**: Runs multi-turn conversations with FastAgent
+4. **Serialization**: Saves complete message history to `complete.json`
+5. **Extraction**: Extracts tool calls from JSON (preserves what FastAgent drops)
+6. **Validation**: Uses BFCL validators to check correctness
+7. **Result**: Pass/fail based on `validation["valid"]`
+
+## Further Reading
+
+- **Test organization and patterns**: See [tests/README.md](../tests/README.md)
+- **BFCL leaderboard**: Visit [gorilla.cs.berkeley.edu](https://gorilla.cs.berkeley.edu/leaderboard.html)
+- **Official BFCL repository**: [github.com/ShishirPatil/gorilla](https://github.com/ShishirPatil/gorilla)
@@ -0,0 +1,16 @@
+default_model: "openai.gpt-4o-mini"
+
+mcp:
+  servers:
+    minimal_test:
+      command: python
+      args: ["/Users/tapanc/dev/elicitation_evals/test_minimal_mcp.py"]
+
+logger:
+  level: debug
+  type: file
+  path: /Users/tapanc/dev/elicitation_evals/minimal_test.jsonl
+  show_chat: true
+  show_tools: true
+  truncate_tools: false
+  progress_display: false
@@ -43,6 +43,7 @@ nav:
     - Roots: middleware/roots.md
     - Elicitation: middleware/elicitation.md
     - Todo: middleware/todo.md
+  - Benchmarks: benchmarks.md
 
 plugins:
   - search
 
@@ -16,7 +16,6 @@ dependencies = [
     "rich>=13.0.0",
     "jinja2>=3.0.0",
     "mpmath>=1.3.0", # TODO: Remove (only used for BFCL evals)
-
 ]
 
 [project.scripts]
@@ -31,8 +30,12 @@ dev = [
     "pytest-asyncio>=0.21",
     "black",
     "ruff",
+    "bfcl-eval",
 ]
 
+[tool.uv.sources]
+bfcl-eval = { git = "https://github.com/chughtapan/gorilla.git", subdirectory = "berkeley-function-call-leaderboard", branch = "wags-dev" }
+
 [tool.black]
 line-length = 100
 target-version = ['py313']
 
@@ -0,0 +1,17 @@
+mcp:
+  servers:
+    github:
+      transport: stdio
+      command: wags
+      args:
+        - run
+        - servers/github
+      env:
+        GITHUB_PERSONAL_ACCESS_TOKEN: ${GITHUB_PERSONAL_ACCESS_TOKEN}
+      roots:
+        - uri: file:///Users/tapanc/dev/wags
+          name: "Local Development Folder"
+        - uri: https://github.com/anthropics/courses
+          name: "Anthropics Courses"
+        - uri: https://github.com/modelcontextprotocol/
+          name: "MCP Organization"