Mobile World: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
1Tongyi Lab, Alibaba Group
2HKUST (GZ)
3University of Florida
*Equal contribution
✉ yue.w@alibaba-inc.com
Leaderboard
Success rate (%) on Mobile World. We report overall SR and breakdown by task category: GUI-Only tasks (116), agent-user interaction tasks (45), and MCP-augmented tasks (40).
| Category | Model | Max Steps | Overall | GUI-Only | User-Int. | MCP |
|---|
Overview
Mobile World is a substantially more challenging mobile-use benchmark designed to better reflect real-world mobile usage. It comprises 201 tasks across 20 applications, featuring long-horizon, cross-app tasks, and novel task categories including agent-user interaction and MCP-augmented tasks.
The difficulty of Mobile World is twofold:
Long-horizon, cross-application tasks. Mobile World tasks require on average 27.8 completion steps, nearly twice as many as 14.3 steps required in AndroidWorld. Moreover, 62.2% of tasks involve cross-application workflows compared to only 9.5% in AndroidWorld.
Novel task categories. Mobile World extends beyond standard GUI manipulation by introducing (1) agent-user interaction tasks (22.4%) that evaluate an agent's ability to handle ambiguous instructions through collaborative dialogue, and (2) MCP-augmented tasks (19.9%) that require hybrid-usage of GUI navigation and external tool invocations via the Model Context Protocol.
Benchmark Comparison
Comparison of online Mobile GUI Agent Benchmarks. Mobile World uniquely enables deterministic evaluation even for applications requiring backends (e.g., messaging) by utilizing self-hosted environments. We also introduce novel task paradigms: agent-user interaction and Model Context Protocol (MCP) augmentation.
| Benchmark | #Apps | #Tasks | Agent-User Int. | MCP-Aug. | Backend-Req. Apps | Deterministic Eval. |
|---|---|---|---|---|---|---|
| AndroidArena | 16 | 221 | ✗ | ✗ | ✓ | ✗ |
| A3 | 20 | 201 | ✗ | ✗ | ✓ | ✗ |
| Pro-Bench | 34 | 200 | ✗ | ✗ | ✓ | ✗ |
| AndroidDaily | 48 | 235 | ✗ | ✗ | ✓ | ✗ |
| SPA-Bench | 66 | 340 | ✗ | ✗ | ✓ | ✗ |
| MobileAgentBench | 10 | 100 | ✗ | ✗ | ✗ | ✓ |
| AndroidLab | 9 | 138 | ✗ | ✗ | ✗ | ✓ |
| AndroidWorld | 20 | 116 | ✗ | ✗ | ✗ | ✓ |
| MobileWorld (Ours) | 20 | 201 | ✓ | ✓ | ✓ | ✓ |
Data Statistics
We summarize key statistics of Mobile World, highlighting four core dimensions:
- Diverse domains: Tasks span communication, productivity, and navigation. Approximately 95% of tasks involve third-party applications, aligning with authentic mobile usage patterns and ensuring real-world relevance.
- Cross-app complexity: 62.2% of tasks require coordination between multiple apps, and 12.4% involve three or more.
- Novel task categories: Agent-user interaction tasks and MCP-augmented tasks account for 42.3% of the dataset. These tasks go beyond pure GUI navigation by requiring dynamic user engagement and external tool use.
- Rigorous verification: Multiple verification methods assess task success: 47.3% of tasks rely on self-hosted database verification, and 36.8% use local storage inspection, highlighting the importance of apps with fully controlled backends.
System Architecture
Key Findings
Our evaluation reveals a sharp performance drop compared to AndroidWorld, where the best agents achieve success rates exceeding 90%. On Mobile World:
The top-performing agentic framework (GPT-5 + UI-Ins-7B) reaches only 51.7% overall success rate, highlighting substantial room for improvement.
End-to-end models show a stark capability collapse on novel task categories: most baseline models score below 10% on agent-user interaction and near 0% on MCP-augmented tasks.
Our error analysis identifies five open research challenges: (1) user ambiguity detection and clarification engagement, (2) MCP context management, (3) long-term memory and state checking, (4) complex logic reasoning, and (5) spatial-temporal context awareness.
Getting Started
Requirements: Docker with privileged container support, KVM for Android emulator acceleration, Python 3.12+, Linux host system (Windows WSL2 + KVM also supported).
Installation:
git clone https://github.com/Tongyi-MAI/MobileWorld.git
cd MobileWorld
uv sync
Check environment and pull Docker image:
sudo mw env check
Launch Docker containers:
sudo mw env run --count 5 --launch-interval 20
Run evaluation:
sudo mw eval \
--agent_type qwen3vl \
--task ALL \
--max_round 50 \
--model_name Qwen3-VL-235B-A22B \
--llm_base_url [openai_compatible_url] \
--enable_mcp
Citation
If you find Mobile World useful in your research, please cite our paper:
@misc{kong2025mobileworldbenchmarkingautonomousmobile,
title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments},
author={Quyu Kong and Xu Zhang and Zhenyu Yang and Nolan Gao and Chen Liu and Panrong Tong and Chenglin Cai and Hanzhang Zhou and Jianan Zhang and Liangyu Chen and Zhidan Liu and Steven Hoi and Yue Wang},
year={2025},
eprint={2512.19432},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.19432},
}
Acknowledgements
We thank Android World and Android-Lab for their open source contributions.