MAI-UI: A Foundational GUI Agent for Mobile Intelligent Assistance
Tongyi Group, Alibaba Group
*Lead contributors.
All authors are core contributors.
†Corresponding to yue.w@alibaba-inc.com
Mobile Intelligent Assistance in Real-World Scenarios.
Watch MAI-UI in action across different scenarios and capabilities.
Demo 1: Office Scenario
Demo 2: Daily Life Scenario
Demo 3: Shopping Scenario
Demo 4: Travel Scenario
Overall performance across GUI grounding and navigation.
GUI grounding Overview
Data Pipeline
Training Paradigm
For GUI grounding, we followed Instruction-as-Reasoning paradigm in UI-Ins, we use a SFT stage to teach model reasoning and RL stage let model explore the appropriate reasoning pathway.
Grounding Performance
| Model | Avg |
|---|---|
| Gemini-3-Pro | 72.7 |
| Seed1.8 | 73.1 |
| GTA1-7B | 50.1 |
| UI-Venus-7B | 50.8 |
| GUI-Owl-7B | 54.9 |
| GUI-Owl-32B | 58.0 |
| GTA1-32B | 63.6 |
| UI-Venus-72B | 61.9 |
| UI-MAI-2B | 57.4 |
| + Zoom-In | 62.8 |
| UI-MAI-8B | 65.8 |
| + Zoom-In | 70.9 |
| UI-MAI-32B | 67.9 |
| + Zoom-In | 73.5 |
| Model | Avg |
|---|---|
| InfiGUI-G1-3B | 22.0 |
| OS-Altas-7B | 9.0 |
| UI-Tars-1.5-7B | 22.3 |
| UI-Venus-7B | 26.5 |
| InfiGUI-G1-7B | 26.1 |
| Phi-Ground | 27.2 |
| UI-TARS-72B | 25.5 |
| UI-Venus-72B | 36.8 |
| UI-MAI-2B | 30.3 |
| + Zoom-In | 31.9 |
| UI-MAI-8B | 40.7 |
| + Zoom-In | 42.4 |
| UI-MAI-32B | 47.1 |
| + Zoom-In | 49.2 |
| Model | Avg |
|---|---|
| InfiGUI-G1-3B | 73.4 |
| OS-Atlas-7B | 41.4 |
| UI-TARS-1.5-7B | 64.3 |
| UGround-V1-7B | 65.7 |
| GTA1-7B | 78.5 |
| GUI-Owl-7B | 80.5 |
| InfiGUI-G1-7B | 80.8 |
| GUI-Owl-32B | 83.0 |
| GTA1-32B | 83.4 |
| UI-TARS-DPO-72B | 74.3 |
| InternVL3-78B | 72.2 |
| UI-MAI-2B | 82.6 |
| UI-MAI-8B | 88.8 |
| UI-MAI-32B | 91.3 |
| Agent Model | Avg |
|---|---|
| UI-TARS-1.5-7B | 52.8 |
| GTA1-7B | 55.1 |
| GUI-Owl-7B | 55.9 |
| UI-Venus-7B | 58.8 |
| OpenCUA-32B | 59.6 |
| GUI-Owl-32B | 58.0 |
| GTA1-32B | 65.2 |
| UI-Venus-72B | 70.4 |
| UI-MAI-2B | 52.0 |
| + Zoom-In | 55.9 |
| UI-MAI-8B | 60.1 |
| + Zoom-In | 64.2 |
| UI-MAI-32B | 67.6 |
| + Zoom-In | 70.9 |
| Agent Model | Avg |
|---|---|
| Operator | 57.8 |
| Jedi-3B | 61.0 |
| Jedi-7B | 63.8 |
| UI-TARS-1.5-7B | 64.2 |
| GTA1-7B | 67.7 |
| Qwen2.5-VL-32B | 59.6 |
| OpenCUA-32B | 70.2 |
| GTA1-32B | 72.2 |
| UI-MAI-2B | 63.5 |
| + Zoom-In | 66.3 |
| UI-MAI-8B | 68.6 |
| + Zoom-In | 72.9 |
| UI-MAI-32B | 73.9 |
| + Zoom-In | 75.0 |
| Model | Avg |
|---|---|
| Phi-ground | 83.8 |
| OS-Atlas-7B | 85.1 |
| UI-Tars-1.5-7B | 89.0 |
| OpenCUA-7B | 92.3 |
| GTA1-7B | 92.4 |
| GUI-Owl-7B | 92.8 |
| UI-Venus-7B | 94.1 |
| GUI-Owl-32B | 93.2 |
| OpenCUA-32B | 93.4 |
| GTA1-32B | 95.2 |
| UI-Venus-72B | 95.3 |
| UI-MAI-2B | 92.5 |
| UI-MAI-8B | 95.2 |
| UI-MAI-32B | 96.5 |
Device-Cloud Collaboration
System Architecture
Demo
Device-cloud collaboration for simple tasks, no need cloud model invocation.
Device-cloud collaboration for complex tasks, requiring cloud model invocation when the task is beyond the device models capabilities.
Performance
Evaluating in Real-World Benchmark
MobileWorld Benchmark
To evaluate MAI-UI’s practical capabilities, we adopt our MOBILEWORLD benchmark, a comprehensive benchmark designed to bridge this evaluation gap. MOBILEWORLD features over 200 realistic tasks spanning 15+ opensource applications across critical domains including e-commerce (Mall4Uni, mirroring Temu/Amazon), enterprise communication (Mattermost, mirroring Microsoft Teams/Slack), social media (Mastodon, mirroring X/Twitter), and daily productivity tools.
Case Study of MCP Call
Case Study of User Interaction
Citation
If you find MAI-UI useful in your research, please cite our papers:
@misc{zhou2025maiuitechnicalreportrealworld,
title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents},
author={Hanzhang Zhou and Xu Zhang and Panrong Tong and Jianan Zhang and Liangyu Chen and Quyu Kong and Chenglin Cai and Chen Liu and Yue Wang and Jingren Zhou and Steven Hoi},
year={2025},
eprint={2512.22047},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.22047},
}
@misc{kong2025mobileworldbenchmarkingautonomousmobile,
title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments},
author={Quyu Kong and Xu Zhang and Zhenyu Yang and Nolan Gao and Chen Liu and Panrong Tong and Chenglin Cai and Hanzhang Zhou and Jianan Zhang and Liangyu Chen and Zhidan Liu and Steven Hoi and Yue Wang},
year={2025},
eprint={2512.19432},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.19432},
}
@misc{chen2025uiinsenhancingguigrounding,
title={UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning},
author={Liangyu Chen and Hanzhang Zhou and Chenglin Cai and Jianan Zhang and Panrong Tong and Quyu Kong and Xu Zhang and Chen Liu and Yuqi Liu and Wenxuan Wang and Yue Wang and Qin Jin and Steven Hoi},
year={2025},
eprint={2510.20286},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.20286},
}