-
Notifications
You must be signed in to change notification settings - Fork 7
We see ~20% (MultiAgent) scores in our androidworld run, Adopted interleaved prompts; request full recipe to match your >70% in AndroidWorld. #15
Description
-
We’re consistently getting ≈20% success on AndroidWorld (episodes), running MobileUse in MultiAgent mode only.
-
Please share the exact, reproducible recipe you used to achieve >70% on AndroidWorld so we
can match it (agent mode/config, prompts, model/provider params, vision/coords, step budgets,
environment, and commit/branch). -
We noticed an earlier issue recommending interleaving images at IMAGE_PLACEHOLDER positions (via a generate_message helper) to avoid <|image_pad|> token errors. Although that fix targeted a different report, we implemented the same pattern here to remove token‑related failures and to keep image–text pairing precise.
-
Despite this, our earlier full‑suite results with MultiAgent (before adopting interleaving) were ~20%. We have not yet re‑run the full suite with only interleaving applied because we’d like to align to your exact recipe first.
What We Changed (Prompt Construction)
- Interleaved image/text in all sub‑agents (Planner, Operator, Reflector, LongReflector, NoteTaker, Evaluator) using a generate_message helper:
- Splits the prompt on IMAGE_PLACEHOLDER and inserts data‑URI images exactly at each placeholder.
- Removed our wrapper‑side message/output mutations (no token scrubbing; no output normalization), so prompts/outputs flow unchanged.
- Interleaved image/text in all sub‑agents (Planner, Operator, Reflector, LongReflector, NoteTaker, Evaluator) using a generate_message helper:
Our Setup (For Previous ~20% Result)
- Mode: MultiAgent only (we did not run SingleAgent for these results).
- Emulator: macOS, GUI (Pixel 6, API 33), gRPC 8554 enabled; adb via Platform‑Tools.
- Steps: ANDROID_MAX_STEP=70.
- Model/provider: MOBILE_USE_VLM_MODEL="qwen/qwen2.5-vl-72b-instruct" via OpenRouter; sometimes
pinned OPENROUTER_PROVIDER_ORDER="nebius/fp8", OPENROUTER_ALLOW_FALLBACKS=false. - AndroidWorld runs:
- First time: python android_world/run.py --suite_family=android_world
--agent_name=mobile_use --console_port=5554 --perform_emulator_setup - Full run: python android_world/run.py --suite_family=android_world
--agent_name=mobile_use --console_port=5554
- First time: python android_world/run.py --suite_family=android_world
Key Questions (Exact Recipe Request)
- Agent mode + modules:
- SingleAgent vs MultiAgent? Which sub‑modules (planner, reflector, long_reflector, note_taker, processor, evaluator, evolutor)? Any special settings (num_histories, num_latest_screenshots, reflect_on_demand/logprobs)?
- Prompts/templates:
- The precise system/user templates you used (post‑interleaving), including any stop sequences or output constraints.
- Model/provider params:
- Which model/version and provider(s), temperature/top_p/max_tokens/tool‑calling and any routing/fallback settings.
- Vision + coordinates:
- Image resize policy and the exact size→raw_size mapping for coordinate conversion.
- Step budgets and termination:
- Global/per‑task step caps and any early termination rules.
- Environment:
- Emulator/ADB flags (gRPC, timeouts), any retry/recovery practices during long runs (e.g.,
ADB daemon settings).
- Emulator/ADB flags (gRPC, timeouts), any retry/recovery practices during long runs (e.g.,
- Code reference:
- Commit/branch you used when reporting >70% on AndroidWorld.
Attachments We Can Provide (On Request)
- VLM request/response traces (sanitized) and full run logs.
- Per‑task outputs from ~/android_world/runs/run_.
Thank you for the great work. With your exact run recipe, we’ll re‑run the full suite and
report back to help close the gap.