MobileWorld: Benchmarking Autonomous Mobile Agents

Leaderboard

Success rate (%) on MobileWorld. We report overall SR (weighted by task counts) and breakdown by task category: GUI-Only tasks (117), agent-user interaction tasks (44), and MCP-augmented tasks (40). All results are evaluated with max steps = 50. Please contact us to submit your results.

Model Type:

Task Categories:

GUI-Only User-Int. MCP

Rank	Category	Model	Overall	GUI-Only	User Interaction	MCP-Augmented

Overview

MobileWorld is a substantially more challenging mobile-use benchmark designed to better reflect real-world mobile usage. It comprises 201 tasks across 20 applications, featuring long-horizon, cross-app tasks, and novel task categories including agent-user interaction and MCP-augmented tasks.

The difficulty of MobileWorld is twofold:

Long-horizon, cross-application tasks. MobileWorld tasks require on average 27.8 completion steps, nearly twice as many as 14.3 steps required in AndroidWorld. Moreover, 62.2% of tasks involve cross-application workflows compared to only 9.5% in AndroidWorld.

Novel task categories. MobileWorld extends beyond standard GUI manipulation by introducing (1) agent-user interaction tasks (22.4%) that evaluate an agent's ability to handle ambiguous instructions through collaborative dialogue, and (2) MCP-augmented tasks (19.9%) that require hybrid-usage of GUI navigation and external tool invocations via the Model Context Protocol.

Comparison with AndroidWorld — **Figure:** Comparison between AndroidWorld and MobileWorld. MobileWorld features harder tasks with more steps, more cross-app workflows, and achieves lower SOTA accuracy, demonstrating its increased difficulty.

Benchmark Comparison

Comparison of online Mobile GUI Agent Benchmarks. MobileWorld uniquely enables deterministic evaluation even for applications requiring backends (e.g., messaging) by utilizing self-hosted environments. We also introduce novel task paradigms: agent-user interaction and Model Context Protocol (MCP) augmentation.

Benchmark	#Apps	#Tasks	Agent-User Int.	MCP-Aug.	Backend-Req. Apps	Deterministic Eval.
AndroidArena	16	221	✗	✗	✓	✗
A3	20	201	✗	✗	✓	✗
Pro-Bench	34	200	✗	✗	✓	✗
AndroidDaily	48	235	✗	✗	✓	✗
SPA-Bench	66	340	✗	✗	✓	✗

MobileAgentBench	10	100	✗	✗	✗	✓
AndroidLab	9	138	✗	✗	✗	✓
AndroidWorld	20	116	✗	✗	✗	✓

MobileWorld (Ours)	20	201	✓	✓	✓	✓

Task Examples

Beyond traditional GUI-only tasks, MobileWorld includes agent-user interaction tasks and MCP-augmented tasks, each with distinct deterministic evaluation strategies.

Left: An example of an agent-user interaction task, in which the agent must proactively request clarification from a simulated user when encountering incomplete information. A GPT-4.1-based simulated user agent is then triggered to provide the requested information, which is embedded in its system prompt. Task completion is verified through the application's callback cache.

Right: An example of an MCP-augmented task, where the agent is initialized with a list of GitHub MCP tools and selects the appropriate tool to retrieve README content from a GitHub repository before completing the task via GUI operations. Task completion is verified through backend database inspection.

Task Examples: Agent-User Interaction and MCP-Augmented Tasks — **Figure:** Examples of agent-user interaction tasks (left) and MCP-augmented tasks (right), demonstrating the novel task categories in MobileWorld beyond traditional GUI manipulation.

Data Statistics

We summarize key statistics of MobileWorld, highlighting four core dimensions:

Diverse domains: Tasks span communication, productivity, and navigation. Approximately 95% of tasks involve third-party applications, aligning with authentic mobile usage patterns and ensuring real-world relevance.
Cross-app complexity: 62.2% of tasks require coordination between multiple apps, and 12.4% involve three or more.
Novel task categories: Agent-user interaction tasks and MCP-augmented tasks account for 42.3% of the dataset. These tasks go beyond pure GUI navigation by requiring dynamic user engagement and external tool use.
Rigorous verification: Multiple verification methods assess task success: 47.3% of tasks rely on self-hosted database verification, and 36.8% use local storage inspection, highlighting the importance of apps with fully controlled backends.

System Architecture

The system architecture of MobileWorld consists of two main components. Left: the host machine is where GUI agents receive task instructions and optionally interact with users for clarification, then choose between GUI actions or MCP tool calls to complete tasks. Right: the docker environment contains an isolated Android ecosystem with emulators, self-hosted app backends, and an evaluator that verifies task completion through text matching, backend database, local storage, and app callbacks.

Acknowledgements

We thank Android World and Android-Lab for their open source contributions.