- Diverse domains: Tasks span communication, productivity, and navigation. Approximately 95% of tasks involve third-party applications, aligning with authentic mobile usage patterns and ensuring real-world relevance.
- Cross-app complexity: 62.2% of tasks require coordination between multiple apps, and 12.4% involve three or more.
- Novel task categories: Agent-user interaction tasks and MCP-augmented tasks account for 42.3% of the dataset. These tasks go beyond pure GUI navigation by requiring dynamic user engagement and external tool use.
- Rigorous verification: Multiple verification methods assess task success: 47.3% of tasks rely on self-hosted database verification, and 36.8% use local storage inspection, highlighting the importance of apps with fully controlled backends.
Leaderboard
Success rate (%) on MobileWorld. We report overall SR and breakdown by task category: GUI-Only tasks (116), agent-user interaction tasks (45), and MCP-augmented tasks (40). All results are evaluated with max steps = 50. Please contact us to submit your results.
| Rank | Category | Model | Overall | GUI-Only | User Interaction | MCP-Augmented |
|---|
Overview
MobileWorld is a substantially more challenging mobile-use benchmark designed to better reflect real-world mobile usage. It comprises 201 tasks across 20 applications, featuring long-horizon, cross-app tasks, and novel task categories including agent-user interaction and MCP-augmented tasks.
The difficulty of MobileWorld is twofold:
Long-horizon, cross-application tasks. MobileWorld tasks require on average 27.8 completion steps, nearly twice as many as 14.3 steps required in AndroidWorld. Moreover, 62.2% of tasks involve cross-application workflows compared to only 9.5% in AndroidWorld.
Novel task categories. MobileWorld extends beyond standard GUI manipulation by introducing (1) agent-user interaction tasks (22.4%) that evaluate an agent's ability to handle ambiguous instructions through collaborative dialogue, and (2) MCP-augmented tasks (19.9%) that require hybrid-usage of GUI navigation and external tool invocations via the Model Context Protocol.
Benchmark Comparison
Comparison of online Mobile GUI Agent Benchmarks. MobileWorld uniquely enables deterministic evaluation even for applications requiring backends (e.g., messaging) by utilizing self-hosted environments. We also introduce novel task paradigms: agent-user interaction and Model Context Protocol (MCP) augmentation.
| Benchmark | #Apps | #Tasks | Agent-User Int. | MCP-Aug. | Backend-Req. Apps | Deterministic Eval. |
|---|---|---|---|---|---|---|
| AndroidArena | 16 | 221 | ✗ | ✗ | ✓ | ✗ |
| A3 | 20 | 201 | ✗ | ✗ | ✓ | ✗ |
| Pro-Bench | 34 | 200 | ✗ | ✗ | ✓ | ✗ |
| AndroidDaily | 48 | 235 | ✗ | ✗ | ✓ | ✗ |
| SPA-Bench | 66 | 340 | ✗ | ✗ | ✓ | ✗ |
| MobileAgentBench | 10 | 100 | ✗ | ✗ | ✗ | ✓ |
| AndroidLab | 9 | 138 | ✗ | ✗ | ✗ | ✓ |
| AndroidWorld | 20 | 116 | ✗ | ✗ | ✗ | ✓ |
| MobileWorld (Ours) | 20 | 201 | ✓ | ✓ | ✓ | ✓ |
Task Examples
Beyond traditional GUI-only tasks, MobileWorld includes agent-user interaction tasks and MCP-augmented tasks, each with distinct deterministic evaluation strategies.
Left: An example of an agent-user interaction task, in which the agent must proactively request clarification from a simulated user when encountering incomplete information. A GPT-4.1-based simulated user agent is then triggered to provide the requested information, which is embedded in its system prompt. Task completion is verified through the application's callback cache.
Right: An example of an MCP-augmented task, where the agent is initialized with a list of GitHub MCP tools and selects the appropriate tool to retrieve README content from a GitHub repository before completing the task via GUI operations. Task completion is verified through backend database inspection.
Data Statistics
We summarize key statistics of MobileWorld, highlighting four core dimensions:
System Architecture
The system architecture of MobileWorld consists of two main components. Left: the host machine is where GUI agents receive task instructions and optionally interact with users for clarification, then choose between GUI actions or MCP tool calls to complete tasks. Right: the docker environment contains an isolated Android ecosystem with emulators, self-hosted app backends, and an evaluator that verifies task completion through text matching, backend database, local storage, and app callbacks.
Acknowledgements
We thank Android World and Android-Lab for their open source contributions.