Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/actions/setup-environment/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: Setup Environment
description: Setup Python, uv, and install dev dependencies

runs:
using: composite
steps:
- name: Setup uv
uses: astral-sh/setup-uv@v3

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"

- name: Install dependencies
shell: bash
run: uv pip install --system -e ".[dev]"
42 changes: 20 additions & 22 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,51 +6,49 @@ on:
branches: [main]

jobs:
pre-commit:
lint:
runs-on: ubuntu-latest
env:
UV_NO_SYNC: 1
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Setup uv
uses: astral-sh/setup-uv@v3

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"

- name: Install dependencies
run: uv pip install --system -e ".[dev]"
- name: Setup environment
uses: ./.github/actions/setup-environment

- name: Install pre-commit
run: uv pip install --system pre-commit

- name: Run pre-commit
- name: Run linting
run: SKIP=mypy pre-commit run --all-files

typecheck:
runs-on: ubuntu-latest
env:
UV_NO_SYNC: 1
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Setup environment
uses: ./.github/actions/setup-environment

- name: Run mypy
run: uv run mypy src/ tests/ --exclude 'tests/benchmarks/appworld/'

test:
runs-on: ubuntu-latest
env:
CI: 1
UV_NO_SYNC: 1

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Setup uv
uses: astral-sh/setup-uv@v3

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"

- name: Install dependencies
run: uv pip install --system -e ".[dev]"
- name: Setup environment
uses: ./.github/actions/setup-environment

- name: Run unit tests
run: uv run pytest tests/unit/ -v -p no:warnings
Expand Down
15 changes: 15 additions & 0 deletions docs/evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,21 @@ Validate existing logs without running new tests:
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --validate-only --log-dir outputs/experiment1/raw
```

### Parallel Execution

Run tests in parallel using multiple workers:

```bash
# Run with 4 workers
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py -n 4

# Run with 8 workers
.venv/bin/pytest tests/benchmarks/appworld/test_appworld.py --dataset train -n 8

# Auto-detect number of CPUs
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py -n auto
```

## Further Reading

### BFCL
Expand Down
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ where = ["src"]
dev = [
"pytest>=7.0",
"pytest-asyncio>=0.21",
"pytest-xdist>=3.0",
"pytest-timeout>=2.0",
"ruff",
"mypy",
]
Expand Down Expand Up @@ -86,6 +88,8 @@ filterwarnings = [
"ignore::sqlalchemy.exc.SADeprecationWarning:appworld.apps.lib.models.db",
"ignore::pydantic.warnings.PydanticDeprecatedSince20:appworld.apps.*",
]
timeout = 0
timeout_method = "thread" # Use thread-based timeout for better async compatibility

[tool.mypy]
warn_return_any = true
Expand Down
9 changes: 9 additions & 0 deletions tests/benchmarks/appworld/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,15 @@ pytest tests/benchmarks/appworld/test_appworld.py --validate-only
--temperature 0.001 # Temperature for sampling (default: 0.001)
```

### Parallel Execution
```bash
-n 4 # Run with 4 workers
-n 8 # Run with 8 workers
-n auto # Auto-detect number of CPUs
```

Example: `pytest tests/benchmarks/appworld/test_appworld.py --dataset train -n 4`

## File Structure

```
Expand Down
1 change: 1 addition & 0 deletions tests/benchmarks/appworld/test_appworld.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ def pytest_generate_tests(metafunc: pytest.Metafunc) -> None:


@pytest.mark.asyncio
@pytest.mark.timeout(300)
async def test_appworld(
task_id: str,
model: str,
Expand Down
1 change: 0 additions & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ def pytest_addoption(parser: pytest.Parser) -> None:
parser.addoption("--output-dir", default="outputs", help="Output directory for results")
parser.addoption("--validate-only", action="store_true", help="Only validate existing logs")
parser.addoption("--log-dir", default="outputs/raw", help="Directory with logs (for validate mode)")
parser.addoption("--max-workers", default=4, type=int, help="Max concurrent tests (default: 4)")


def pytest_configure(config: pytest.Config) -> None:
Expand Down