MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Quyu Kong*,1, Xu Zhang*,1, Zhenyu Yang2, Nolan Gao3, Chen Liu1, Panrong Tong1, Chenglin Cai1, Hanzhang Zhou1, Jianan Zhang1, Liangyu Chen1, Zhidan Liu2, Steven HOI1, Yue Wang1,✉
1Tongyi Lab, Alibaba Group 2HKUST (GZ) 3University of Florida
*Equal contribution

Leaderboard

Success rate (%) on MobileWorld. We report overall SR and breakdown by task category: GUI-Only tasks (116), agent-user interaction tasks (45), and MCP-augmented tasks (40). All results are evaluated with max steps = 50. Please contact us to submit your results.

Rank Category Model Overall GUI-Only User Interaction MCP-Augmented

Overview

MobileWorld is a substantially more challenging mobile-use benchmark designed to better reflect real-world mobile usage. It comprises 201 tasks across 20 applications, featuring long-horizon, cross-app tasks, and novel task categories including agent-user interaction and MCP-augmented tasks.

The difficulty of MobileWorld is twofold:

Long-horizon, cross-application tasks. MobileWorld tasks require on average 27.8 completion steps, nearly twice as many as 14.3 steps required in AndroidWorld. Moreover, 62.2% of tasks involve cross-application workflows compared to only 9.5% in AndroidWorld.

Novel task categories. MobileWorld extends beyond standard GUI manipulation by introducing (1) agent-user interaction tasks (22.4%) that evaluate an agent's ability to handle ambiguous instructions through collaborative dialogue, and (2) MCP-augmented tasks (19.9%) that require hybrid-usage of GUI navigation and external tool invocations via the Model Context Protocol.

Comparison with AndroidWorld
Figure: Comparison between AndroidWorld and MobileWorld. MobileWorld features harder tasks with more steps, more cross-app workflows, and achieves lower SOTA accuracy, demonstrating its increased difficulty.

Benchmark Comparison

Comparison of online Mobile GUI Agent Benchmarks. MobileWorld uniquely enables deterministic evaluation even for applications requiring backends (e.g., messaging) by utilizing self-hosted environments. We also introduce novel task paradigms: agent-user interaction and Model Context Protocol (MCP) augmentation.

Benchmark #Apps #Tasks Agent-User Int. MCP-Aug. Backend-Req. Apps Deterministic Eval.
AndroidArena 16 221
A3 20 201
Pro-Bench 34 200
AndroidDaily 48 235
SPA-Bench 66 340
MobileAgentBench 10 100
AndroidLab 9 138
AndroidWorld 20 116
MobileWorld (Ours) 20 201

Task Examples

Beyond traditional GUI-only tasks, MobileWorld includes agent-user interaction tasks and MCP-augmented tasks, each with distinct deterministic evaluation strategies.

Left: An example of an agent-user interaction task, in which the agent must proactively request clarification from a simulated user when encountering incomplete information. A GPT-4.1-based simulated user agent is then triggered to provide the requested information, which is embedded in its system prompt. Task completion is verified through the application's callback cache.

Right: An example of an MCP-augmented task, where the agent is initialized with a list of GitHub MCP tools and selects the appropriate tool to retrieve README content from a GitHub repository before completing the task via GUI operations. Task completion is verified through backend database inspection.

Task Examples: Agent-User Interaction and MCP-Augmented Tasks
Figure: Examples of agent-user interaction tasks (left) and MCP-augmented tasks (right), demonstrating the novel task categories in MobileWorld beyond traditional GUI manipulation.

Data Statistics

We summarize key statistics of MobileWorld, highlighting four core dimensions:

  1. Diverse domains: Tasks span communication, productivity, and navigation. Approximately 95% of tasks involve third-party applications, aligning with authentic mobile usage patterns and ensuring real-world relevance.
  2. Cross-app complexity: 62.2% of tasks require coordination between multiple apps, and 12.4% involve three or more.
  3. Novel task categories: Agent-user interaction tasks and MCP-augmented tasks account for 42.3% of the dataset. These tasks go beyond pure GUI navigation by requiring dynamic user engagement and external tool use.
  4. Rigorous verification: Multiple verification methods assess task success: 47.3% of tasks rely on self-hosted database verification, and 36.8% use local storage inspection, highlighting the importance of apps with fully controlled backends.
Task distribution across scenarios

System Architecture

The system architecture of MobileWorld consists of two main components. Left: the host machine is where GUI agents receive task instructions and optionally interact with users for clarification, then choose between GUI actions or MCP tool calls to complete tasks. Right: the docker environment contains an isolated Android ecosystem with emulators, self-hosted app backends, and an evaluator that verifies task completion through text matching, backend database, local storage, and app callbacks.

System Architecture of MobileWorld

Acknowledgements

We thank Android World and Android-Lab for their open source contributions.