Mobile World: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Quyu Kong*,1, Xu Zhang*,1, Zhenyu Yang2, Nolan Gao3, Chen Liu1, Panrong Tong1, Chenglin Cai1, Hanzhang Zhou1, Jianan Zhang1, Liangyu Chen1, Zhidan Liu2, Steven HOI1, Yue Wang1,✉

1Tongyi Lab, Alibaba Group    2HKUST (GZ)    3University of Florida
*Equal contribution   

Leaderboard

Success rate (%) on Mobile World. We report overall SR and breakdown by task category: GUI-Only tasks (116), agent-user interaction tasks (45), and MCP-augmented tasks (40).

Category Model Max Steps Overall GUI-Only User-Int. MCP

Overview

Mobile World is a substantially more challenging mobile-use benchmark designed to better reflect real-world mobile usage. It comprises 201 tasks across 20 applications, featuring long-horizon, cross-app tasks, and novel task categories including agent-user interaction and MCP-augmented tasks.

The difficulty of Mobile World is twofold:

Long-horizon, cross-application tasks. Mobile World tasks require on average 27.8 completion steps, nearly twice as many as 14.3 steps required in AndroidWorld. Moreover, 62.2% of tasks involve cross-application workflows compared to only 9.5% in AndroidWorld.

Novel task categories. Mobile World extends beyond standard GUI manipulation by introducing (1) agent-user interaction tasks (22.4%) that evaluate an agent's ability to handle ambiguous instructions through collaborative dialogue, and (2) MCP-augmented tasks (19.9%) that require hybrid-usage of GUI navigation and external tool invocations via the Model Context Protocol.

Benchmark Comparison

Comparison of online Mobile GUI Agent Benchmarks. Mobile World uniquely enables deterministic evaluation even for applications requiring backends (e.g., messaging) by utilizing self-hosted environments. We also introduce novel task paradigms: agent-user interaction and Model Context Protocol (MCP) augmentation.

Benchmark #Apps #Tasks Agent-User Int. MCP-Aug. Backend-Req. Apps Deterministic Eval.
AndroidArena 16 221
A3 20 201
Pro-Bench 34 200
AndroidDaily 48 235
SPA-Bench 66 340
MobileAgentBench 10 100
AndroidLab 9 138
AndroidWorld 20 116
MobileWorld (Ours) 20 201

Data Statistics

We summarize key statistics of Mobile World, highlighting four core dimensions:

  1. Diverse domains: Tasks span communication, productivity, and navigation. Approximately 95% of tasks involve third-party applications, aligning with authentic mobile usage patterns and ensuring real-world relevance.
  2. Cross-app complexity: 62.2% of tasks require coordination between multiple apps, and 12.4% involve three or more.
  3. Novel task categories: Agent-user interaction tasks and MCP-augmented tasks account for 42.3% of the dataset. These tasks go beyond pure GUI navigation by requiring dynamic user engagement and external tool use.
  4. Rigorous verification: Multiple verification methods assess task success: 47.3% of tasks rely on self-hosted database verification, and 36.8% use local storage inspection, highlighting the importance of apps with fully controlled backends.
Task distribution across scenarios

System Architecture

System Architecture of Mobile World
The system architecture of Mobile World consists of two main components. Left: the host machine is where GUI agents receive task instructions and optionally interact with users for clarification, then choose between GUI actions or MCP tool calls to complete tasks. Right: the docker environment contains an isolated Android ecosystem with emulators, self-hosted app backends, and an evaluator that verifies task completion through text matching, backend database, local storage, and app callbacks.

Key Findings

Our evaluation reveals a sharp performance drop compared to AndroidWorld, where the best agents achieve success rates exceeding 90%. On Mobile World:

The top-performing agentic framework (GPT-5 + UI-Ins-7B) reaches only 51.7% overall success rate, highlighting substantial room for improvement.

End-to-end models show a stark capability collapse on novel task categories: most baseline models score below 10% on agent-user interaction and near 0% on MCP-augmented tasks.

Our error analysis identifies five open research challenges: (1) user ambiguity detection and clarification engagement, (2) MCP context management, (3) long-term memory and state checking, (4) complex logic reasoning, and (5) spatial-temporal context awareness.

Getting Started

Requirements: Docker with privileged container support, KVM for Android emulator acceleration, Python 3.12+, Linux host system (Windows WSL2 + KVM also supported).

Installation:

git clone https://github.com/Tongyi-MAI/MobileWorld.git
cd MobileWorld
uv sync

Check environment and pull Docker image:

sudo mw env check

Launch Docker containers:

sudo mw env run --count 5 --launch-interval 20

Run evaluation:

sudo mw eval \
    --agent_type qwen3vl \
    --task ALL \
    --max_round 50 \
    --model_name Qwen3-VL-235B-A22B \
    --llm_base_url [openai_compatible_url] \
    --enable_mcp

Citation

If you find Mobile World useful in your research, please cite our paper:

@misc{kong2025mobileworldbenchmarkingautonomousmobile,
      title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments}, 
      author={Quyu Kong and Xu Zhang and Zhenyu Yang and Nolan Gao and Chen Liu and Panrong Tong and Chenglin Cai and Hanzhang Zhou and Jianan Zhang and Liangyu Chen and Zhidan Liu and Steven Hoi and Yue Wang},
      year={2025},
      eprint={2512.19432},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.19432}, 
}

Acknowledgements

We thank Android World and Android-Lab for their open source contributions.