Commit 9c99603
Fix AppWorld evaluation and improve prompt (#20)
* Fix AppWorld evaluation and improve prompt
- Fix: Save model_hashes.json to enable proper change detection
* AppWorld's evaluator relies on model hash counters to detect DB changes
* Without saving these hashes, evaluation fails even when agent completes
* Added save_model_hashes=True in mcp_server.py:234
- Improve: Remove turn limit from agent prompt
* Removed max_steps parameter and references from system instruction
* Eliminates artificial pressure on agent to be overly conservative
* Improves task completion rate by reducing unnecessary batching
- Increase test timeout from 5 to 15 minutes
* Accounts for bearer token expiration issues with some tasks
* Prevents premature timeout on complex tasks
Verified with successful test runs on tasks 692c77d_2 and 22cc237_3.
* Update uv.lock with dependency markers
* Trigger CI
* Remove analysis scripts from working directory
* nit
---------
Co-authored-by: Tapan Chugh <[email protected]>1 parent 8685e86 commit 9c99603
File tree
5 files changed
+44
-30
lines changed- tests/benchmarks/appworld
5 files changed
+44
-30
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
230 | 230 | | |
231 | 231 | | |
232 | 232 | | |
233 | | - | |
| 233 | + | |
234 | 234 | | |
235 | 235 | | |
236 | 236 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
23 | 22 | | |
24 | 23 | | |
25 | 24 | | |
| |||
40 | 39 | | |
41 | 40 | | |
42 | 41 | | |
43 | | - | |
44 | 42 | | |
45 | 43 | | |
46 | 44 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
| 8 | + | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
53 | | - | |
| 53 | + | |
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| |||
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
0 commit comments