Skip to content

Commit 9c99603

Browse files
chughtapanTapan Chugh
andauthored
Fix AppWorld evaluation and improve prompt (#20)
* Fix AppWorld evaluation and improve prompt - Fix: Save model_hashes.json to enable proper change detection * AppWorld's evaluator relies on model hash counters to detect DB changes * Without saving these hashes, evaluation fails even when agent completes * Added save_model_hashes=True in mcp_server.py:234 - Improve: Remove turn limit from agent prompt * Removed max_steps parameter and references from system instruction * Eliminates artificial pressure on agent to be overly conservative * Improves task completion rate by reducing unnecessary batching - Increase test timeout from 5 to 15 minutes * Accounts for bearer token expiration issues with some tasks * Prevents premature timeout on complex tasks Verified with successful test runs on tasks 692c77d_2 and 22cc237_3. * Update uv.lock with dependency markers * Trigger CI * Remove analysis scripts from working directory * nit --------- Co-authored-by: Tapan Chugh <[email protected]>
1 parent 8685e86 commit 9c99603

File tree

5 files changed

+44
-30
lines changed

5 files changed

+44
-30
lines changed

tests/benchmarks/appworld/mcp_server.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ async def call_tool(name: str, arguments: dict[str, Any]) -> Any:
230230
# Save databases on task completion
231231
if api_name == "complete_task" or name == "supervisor__complete_task":
232232
Path(db_paths.output_db_path).mkdir(parents=True, exist_ok=True)
233-
collections.model_collection.save(db_home_path=db_paths.output_db_path)
233+
collections.model_collection.save(db_home_path=db_paths.output_db_path, save_model_hashes=True)
234234

235235
return format_tool_response(response)
236236
except Exception as e:

tests/benchmarks/appworld/prompts.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,12 @@
1313
EXPERIMENTS_PATH = Path(appworld_experiments.__file__).parent
1414

1515

16-
def load_system_instruction(task: Task, max_steps: int = 40) -> str:
16+
def load_system_instruction(task: Task) -> str:
1717
"""
1818
Load and render system instruction from AppWorld's template with demo examples.
1919
2020
Args:
2121
task: AppWorld Task object
22-
max_steps: Maximum number of turns allowed
2322
2423
Returns:
2524
Rendered system instruction with supervisor info, rules, and demos
@@ -40,7 +39,6 @@ def load_system_instruction(task: Task, max_steps: int = 40) -> str:
4039
template_content,
4140
main_user=task.supervisor,
4241
app_descriptions=app_descriptions_yaml,
43-
max_steps=max_steps,
4442
)
4543

4644
# Load demo messages and format them

tests/benchmarks/appworld/system_instruction.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ My name is: {{ main_user.first_name }} {{ main_user.last_name }}. My personal em
55

66
You will be given a task instruction and a list of functions in the standard format. The functions correspond to APIs from various apps you have access to. The function name has two parts, the app name and API name separated by "__", e.g., spotify__login is the login API for the Spotify app.
77

8-
You will complete the task completely autonomously through multi-turn interaction with the execution environment. In each turn, you will make one or more function calls, and the environment will return its outputs. This will continue either until you call `complete_task` API from the Supervisor app, or until a maximum of {max_steps} turns are reached.
8+
You will complete the task completely autonomously through multi-turn interaction with the execution environment. In each turn, you will make one or more function calls, and the environment will return its outputs. This will continue until you call `complete_task` API from the Supervisor app.
99

1010
Here are brief app-wise descriptions.
1111

@@ -21,7 +21,7 @@ A. General instructions:
2121
- Never leave placeholders; don't output things like "your_username". Always fill in the real value by retrieving it via APIs (e.g., Supervisor app for credentials).
2222
- When I omit details, choose any valid value. For example, if I ask you to buy something but don't specify which payment card to use, you may pick any one of my available cards.
2323
- Avoid collateral damage. Only perform what I explicitly ask for. Example: if I ask you to buy something, do not delete emails, return the order, or perform unrelated account operations.
24-
- You only have {max_steps} turns. Avoid unnecessary requests. You can batch unlimited function calls in a single turn - always group them to save steps.
24+
- Avoid unnecessary requests.
2525

2626
B. App-specific instructions:
2727

tests/benchmarks/appworld/test_appworld.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ def pytest_generate_tests(metafunc: pytest.Metafunc) -> None:
5050

5151

5252
@pytest.mark.asyncio
53-
@pytest.mark.timeout(300)
53+
@pytest.mark.timeout(900)
5454
async def test_appworld(
5555
task_id: str,
5656
model: str,

uv.lock

Lines changed: 39 additions & 23 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)