We see ~20% (MultiAgent) scores in our androidworld run,  Adopted interleaved prompts; request full recipe to match your >70% in AndroidWorld.

- We’re consistently getting ≈20% success on AndroidWorld (episodes), running MobileUse in MultiAgent mode only.
- Please share the exact, reproducible recipe you used to achieve >70% on AndroidWorld so we
  can match it (agent mode/config, prompts, model/provider params, vision/coords, step budgets,
  environment, and commit/branch).

- We noticed an earlier issue recommending interleaving images at IMAGE_PLACEHOLDER positions (via a generate_message helper) to avoid <|image_pad|> token errors. Although that fix targeted a different report, we implemented the same pattern here to remove token‑related failures and to keep image–text pairing precise.
- Despite this, our earlier full‑suite results with MultiAgent (before adopting interleaving) were ~20%. We have not yet re‑run the full suite with only interleaving applied because we’d like to align to your exact recipe first.

  **What We Changed (Prompt Construction)**

  - Interleaved image/text in all sub‑agents (Planner, Operator, Reflector, LongReflector, NoteTaker, Evaluator) using a generate_message helper:
      - Splits the prompt on IMAGE_PLACEHOLDER and inserts data‑URI images exactly at each placeholder.
  - Removed our wrapper‑side message/output mutations (no token scrubbing; no output normalization), so prompts/outputs flow unchanged.

**Our Setup (For Previous ~20% Result)**

  - Mode: MultiAgent only (we did not run SingleAgent for these results).
  - Emulator: macOS, GUI (Pixel 6, API 33), gRPC 8554 enabled; adb via Platform‑Tools.
  - Steps: ANDROID_MAX_STEP=70.
  - Model/provider: MOBILE_USE_VLM_MODEL="qwen/qwen2.5-vl-72b-instruct" via OpenRouter; sometimes
  pinned OPENROUTER_PROVIDER_ORDER="nebius/fp8", OPENROUTER_ALLOW_FALLBACKS=false.
  - AndroidWorld runs:
      - First time: python android_world/run.py --suite_family=android_world
  --agent_name=mobile_use --console_port=5554 --perform_emulator_setup
      - Full run: python android_world/run.py --suite_family=android_world
  --agent_name=mobile_use --console_port=5554

**Key Questions (Exact Recipe Request)**

  - Agent mode + modules:
      - SingleAgent vs MultiAgent? Which sub‑modules (planner, reflector, long_reflector,  note_taker, processor, evaluator, evolutor)? Any special settings (num_histories, num_latest_screenshots, reflect_on_demand/logprobs)?
  - Prompts/templates:
      - The precise system/user templates you used (post‑interleaving), including any stop sequences or output constraints.
  - Model/provider params:
      - Which model/version and provider(s), temperature/top_p/max_tokens/tool‑calling and any routing/fallback settings.
  - Vision + coordinates:
      - Image resize policy and the exact size→raw_size mapping for coordinate conversion.
  - Step budgets and termination:
      - Global/per‑task step caps and any early termination rules.
  - Environment:
      - Emulator/ADB flags (gRPC, timeouts), any retry/recovery practices during long runs (e.g.,
  ADB daemon settings).
  - Code reference:
      - Commit/branch you used when reporting >70% on AndroidWorld.

  **Attachments We Can Provide (On Request)**

  - VLM request/response traces (sanitized) and full run logs.
  - Per‑task outputs from ~/android_world/runs/run_<timestamp>.

  Thank you for the great work. With your exact run recipe, we’ll re‑run the full suite and
  report back to help close the gap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We see ~20% (MultiAgent) scores in our androidworld run, Adopted interleaved prompts; request full recipe to match your >70% in AndroidWorld. #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

We see ~20% (MultiAgent) scores in our androidworld run, Adopted interleaved prompts; request full recipe to match your >70% in AndroidWorld. #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions