Skip to content

We see ~20% (MultiAgent) scores in our androidworld run, Adopted interleaved prompts; request full recipe to match your >70% in AndroidWorld. #15

@rakshithsajjan

Description

@rakshithsajjan
  • We’re consistently getting ≈20% success on AndroidWorld (episodes), running MobileUse in MultiAgent mode only.

  • Please share the exact, reproducible recipe you used to achieve >70% on AndroidWorld so we
    can match it (agent mode/config, prompts, model/provider params, vision/coords, step budgets,
    environment, and commit/branch).

  • We noticed an earlier issue recommending interleaving images at IMAGE_PLACEHOLDER positions (via a generate_message helper) to avoid <|image_pad|> token errors. Although that fix targeted a different report, we implemented the same pattern here to remove token‑related failures and to keep image–text pairing precise.

  • Despite this, our earlier full‑suite results with MultiAgent (before adopting interleaving) were ~20%. We have not yet re‑run the full suite with only interleaving applied because we’d like to align to your exact recipe first.

    What We Changed (Prompt Construction)

    • Interleaved image/text in all sub‑agents (Planner, Operator, Reflector, LongReflector, NoteTaker, Evaluator) using a generate_message helper:
      • Splits the prompt on IMAGE_PLACEHOLDER and inserts data‑URI images exactly at each placeholder.
    • Removed our wrapper‑side message/output mutations (no token scrubbing; no output normalization), so prompts/outputs flow unchanged.

Our Setup (For Previous ~20% Result)

  • Mode: MultiAgent only (we did not run SingleAgent for these results).
  • Emulator: macOS, GUI (Pixel 6, API 33), gRPC 8554 enabled; adb via Platform‑Tools.
  • Steps: ANDROID_MAX_STEP=70.
  • Model/provider: MOBILE_USE_VLM_MODEL="qwen/qwen2.5-vl-72b-instruct" via OpenRouter; sometimes
    pinned OPENROUTER_PROVIDER_ORDER="nebius/fp8", OPENROUTER_ALLOW_FALLBACKS=false.
  • AndroidWorld runs:
    • First time: python android_world/run.py --suite_family=android_world
      --agent_name=mobile_use --console_port=5554 --perform_emulator_setup
    • Full run: python android_world/run.py --suite_family=android_world
      --agent_name=mobile_use --console_port=5554

Key Questions (Exact Recipe Request)

  • Agent mode + modules:
    • SingleAgent vs MultiAgent? Which sub‑modules (planner, reflector, long_reflector, note_taker, processor, evaluator, evolutor)? Any special settings (num_histories, num_latest_screenshots, reflect_on_demand/logprobs)?
  • Prompts/templates:
    • The precise system/user templates you used (post‑interleaving), including any stop sequences or output constraints.
  • Model/provider params:
    • Which model/version and provider(s), temperature/top_p/max_tokens/tool‑calling and any routing/fallback settings.
  • Vision + coordinates:
    • Image resize policy and the exact size→raw_size mapping for coordinate conversion.
  • Step budgets and termination:
    • Global/per‑task step caps and any early termination rules.
  • Environment:
    • Emulator/ADB flags (gRPC, timeouts), any retry/recovery practices during long runs (e.g.,
      ADB daemon settings).
  • Code reference:
    • Commit/branch you used when reporting >70% on AndroidWorld.

Attachments We Can Provide (On Request)

  • VLM request/response traces (sanitized) and full run logs.
  • Per‑task outputs from ~/android_world/runs/run_.

Thank you for the great work. With your exact run recipe, we’ll re‑run the full suite and
report back to help close the gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions