Add AppWorld evals integration #16

chughtapan · 2025-10-23T01:26:38Z

No description provided.

Copilot

Pull Request Overview

This PR integrates AppWorld benchmark evaluation into the WAGS testing framework, enabling realistic task evaluation across 9 day-to-day apps. The integration includes a custom MCP server wrapper, API prediction helpers, pytest configuration, and comprehensive documentation.

Key changes:

Added AppWorld evaluation test suite with pytest integration for dataset-driven testing
Implemented MCP server wrapper for task-specific database initialization and API filtering
Created helper modules for system instruction rendering and API prediction
Updated project dependencies and CI configuration to support AppWorld

Reviewed Changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/benchmarks/appworld/test_appworld.py	Main test implementation with pytest parametrization and task execution flow
tests/benchmarks/appworld/mcp_server.py	MCP server wrapper managing AppWorld task state and API execution
tests/benchmarks/appworld/appworld_helpers.py	Helper functions for API prediction and instruction template rendering
tests/benchmarks/appworld/conftest.py	Pytest fixtures for dataset, limit, and API mode configuration
tests/benchmarks/appworld/system_instruction.txt	System instruction template for agent behavior
tests/benchmarks/appworld/fastagent.config.yaml	FastAgent configuration for MCP server connection
tests/benchmarks/appworld/init.py	Package initialization marker
tests/benchmarks/appworld/.gitignore	Git ignore patterns for AppWorld data and outputs
tests/benchmarks/appworld/README.md	Comprehensive documentation for AppWorld benchmark usage
pyproject.toml	Updated dependencies with AppWorld packages and configuration
tests/README.md	Updated test documentation with AppWorld instructions
docs/evals.md	Added AppWorld setup and usage documentation
README.md	Updated main README with AppWorld benchmark information
.github/workflows/ci.yml	Modified CI workflow to exclude AppWorld from type checking

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-23T01:36:32Z

tests/benchmarks/appworld/test_appworld.py

+    )
+
+    assert test_tracker.success, (
+        f"Task {task_id} failed: {test_tracker.failures[0] if test_tracker.failures else 'Unknown'}"


[nitpick] The error message accesses test_tracker.failures[0] without first verifying that the list is non-empty. While the condition checks if test_tracker.failures, accessing index 0 directly could be clearer as test_tracker.failures[0] if test_tracker.failures else 'Unknown reason' to make the logic more explicit.

Suggested change

f"Task {task_id} failed: {test_tracker.failures[0] if test_tracker.failures else 'Unknown'}"

f"Task {task_id} failed: {test_tracker.failures[0] if test_tracker.failures else 'Unknown reason'}"

Tapan Chugh added 4 commits October 22, 2025 15:03

update gitignore

004748a

Add AppWorld evals

f68ba26

nit: remove dead options

0b7401c

try to fix ci to work with appworld/

bc97147

chughtapan merged commit eeefb00 into main Oct 23, 2025
2 checks passed

chughtapan deleted the wags-datasets branch October 23, 2025 01:34

chughtapan requested a review from Copilot October 23, 2025 01:36

Copilot AI reviewed Oct 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AppWorld evals integration #16

Add AppWorld evals integration #16

Uh oh!

chughtapan commented Oct 23, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	f"Task {task_id} failed: {test_tracker.failures[0] if test_tracker.failures else 'Unknown'}"
	f"Task {task_id} failed: {test_tracker.failures[0] if test_tracker.failures else 'Unknown reason'}"

Add AppWorld evals integration #16

Add AppWorld evals integration #16

Uh oh!

Conversation

chughtapan commented Oct 23, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants