Skip to content

Conversation

@chughtapan
Copy link
Owner

No description provided.

@chughtapan chughtapan merged commit eeefb00 into main Oct 23, 2025
2 checks passed
@chughtapan chughtapan deleted the wags-datasets branch October 23, 2025 01:34
@chughtapan chughtapan requested a review from Copilot October 23, 2025 01:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates AppWorld benchmark evaluation into the WAGS testing framework, enabling realistic task evaluation across 9 day-to-day apps. The integration includes a custom MCP server wrapper, API prediction helpers, pytest configuration, and comprehensive documentation.

Key changes:

  • Added AppWorld evaluation test suite with pytest integration for dataset-driven testing
  • Implemented MCP server wrapper for task-specific database initialization and API filtering
  • Created helper modules for system instruction rendering and API prediction
  • Updated project dependencies and CI configuration to support AppWorld

Reviewed Changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/benchmarks/appworld/test_appworld.py Main test implementation with pytest parametrization and task execution flow
tests/benchmarks/appworld/mcp_server.py MCP server wrapper managing AppWorld task state and API execution
tests/benchmarks/appworld/appworld_helpers.py Helper functions for API prediction and instruction template rendering
tests/benchmarks/appworld/conftest.py Pytest fixtures for dataset, limit, and API mode configuration
tests/benchmarks/appworld/system_instruction.txt System instruction template for agent behavior
tests/benchmarks/appworld/fastagent.config.yaml FastAgent configuration for MCP server connection
tests/benchmarks/appworld/init.py Package initialization marker
tests/benchmarks/appworld/.gitignore Git ignore patterns for AppWorld data and outputs
tests/benchmarks/appworld/README.md Comprehensive documentation for AppWorld benchmark usage
pyproject.toml Updated dependencies with AppWorld packages and configuration
tests/README.md Updated test documentation with AppWorld instructions
docs/evals.md Added AppWorld setup and usage documentation
README.md Updated main README with AppWorld benchmark information
.github/workflows/ci.yml Modified CI workflow to exclude AppWorld from type checking

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

)

assert test_tracker.success, (
f"Task {task_id} failed: {test_tracker.failures[0] if test_tracker.failures else 'Unknown'}"
Copy link

Copilot AI Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The error message accesses test_tracker.failures[0] without first verifying that the list is non-empty. While the condition checks if test_tracker.failures, accessing index 0 directly could be clearer as test_tracker.failures[0] if test_tracker.failures else 'Unknown reason' to make the logic more explicit.

Suggested change
f"Task {task_id} failed: {test_tracker.failures[0] if test_tracker.failures else 'Unknown'}"
f"Task {task_id} failed: {test_tracker.failures[0] if test_tracker.failures else 'Unknown reason'}"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants