MAI-UI: A Foundational GUI Agent for Mobile Intelligent Assistance

Hanzhang Zhou*, Xu Zhang*, Panrong Tong, Jianan Zhang, Liangyu Chen,

Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven HOI

Tongyi Group, Alibaba Group

*Lead contributors.
All authors are core contributors.
Corresponding to yue.w@alibaba-inc.com

Mobile Intelligent Assistance in Real-World Scenarios.

Watch MAI-UI in action across different scenarios and capabilities.

Demo 1: Office Scenario

Demo 2: Daily Life Scenario

Demo 3: Shopping Scenario

Demo 4: Travel Scenario

Overall performance across GUI grounding and navigation.

MAI-UI achieves SOTA performance across GUI grounding, outperforming Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro and significantly outperform existing models on UI-Vision,(Left) MAI-UI achieves SOTA performance on famous navigation benchmark AndroidWorld (Middle), and set a new SOTA performance on real-world benchmark MobileWorld (Right).

GUI grounding Overview

Data Pipeline

Training Paradigm

For GUI grounding, we followed Instruction-as-Reasoning paradigm in UI-Ins, we use a SFT stage to teach model reasoning and RL stage let model explore the appropriate reasoning pathway.

Grounding Performance

ScreenSpot-Pro
Model Avg
Gemini-3-Pro 72.7
Seed1.8 73.1
GTA1-7B 50.1
UI-Venus-7B 50.8
GUI-Owl-7B 54.9
GUI-Owl-32B 58.0
GTA1-32B 63.6
UI-Venus-72B 61.9
UI-MAI-2B 57.4
+ Zoom-In 62.8
UI-MAI-8B 65.8
+ Zoom-In 70.9
UI-MAI-32B 67.9
+ Zoom-In 73.5
UI-Vision
Model Avg
InfiGUI-G1-3B 22.0
OS-Altas-7B 9.0
UI-Tars-1.5-7B 22.3
UI-Venus-7B 26.5
InfiGUI-G1-7B 26.1
Phi-Ground 27.2
UI-TARS-72B 25.5
UI-Venus-72B 36.8
UI-MAI-2B 30.3
+ Zoom-In 31.9
UI-MAI-8B 40.7
+ Zoom-In 42.4
UI-MAI-32B 47.1
+ Zoom-In 49.2
MMBench-GUI L2
Model Avg
InfiGUI-G1-3B 73.4
OS-Atlas-7B 41.4
UI-TARS-1.5-7B 64.3
UGround-V1-7B 65.7
GTA1-7B 78.5
GUI-Owl-7B 80.5
InfiGUI-G1-7B 80.8
GUI-Owl-32B 83.0
GTA1-32B 83.4
UI-TARS-DPO-72B 74.3
InternVL3-78B 72.2
UI-MAI-2B 82.6
UI-MAI-8B 88.8
UI-MAI-32B 91.3
OSWorld-G
Agent Model Avg
UI-TARS-1.5-7B 52.8
GTA1-7B 55.1
GUI-Owl-7B 55.9
UI-Venus-7B 58.8
OpenCUA-32B 59.6
GUI-Owl-32B 58.0
GTA1-32B 65.2
UI-Venus-72B 70.4
UI-MAI-2B 52.0
+ Zoom-In 55.9
UI-MAI-8B 60.1
+ Zoom-In 64.2
UI-MAI-32B 67.6
+ Zoom-In 70.9
OSWorld-G-Refine
Agent Model Avg
Operator 57.8
Jedi-3B 61.0
Jedi-7B 63.8
UI-TARS-1.5-7B 64.2
GTA1-7B 67.7
Qwen2.5-VL-32B 59.6
OpenCUA-32B 70.2
GTA1-32B 72.2
UI-MAI-2B 63.5
+ Zoom-In 66.3
UI-MAI-8B 68.6
+ Zoom-In 72.9
UI-MAI-32B 73.9
+ Zoom-In 75.0
ScreenSpot-V2
Model Avg
Phi-ground 83.8
OS-Atlas-7B 85.1
UI-Tars-1.5-7B 89.0
OpenCUA-7B 92.3
GTA1-7B 92.4
GUI-Owl-7B 92.8
UI-Venus-7B 94.1
GUI-Owl-32B 93.2
OpenCUA-32B 93.4
GTA1-32B 95.2
UI-Venus-72B 95.3
UI-MAI-2B 92.5
UI-MAI-8B 95.2
UI-MAI-32B 96.5

Device-Cloud Collaboration

System Architecture

Demo

Device-cloud collaboration for simple tasks, no need cloud model invocation.

Device-cloud collaboration for complex tasks, requiring cloud model invocation when the task is beyond the device models capabilities.

Performance

Evaluating in Real-World Benchmark

MobileWorld Benchmark

To evaluate MAI-UI’s practical capabilities, we adopt our MOBILEWORLD benchmark, a comprehensive benchmark designed to bridge this evaluation gap. MOBILEWORLD features over 200 realistic tasks spanning 15+ opensource applications across critical domains including e-commerce (Mall4Uni, mirroring Temu/Amazon), enterprise communication (Mattermost, mirroring Microsoft Teams/Slack), social media (Mastodon, mirroring X/Twitter), and daily productivity tools.

Case Study of MCP Call

MAI-UI System Architecture
Case studies of MCP tool using of MAI-UI. (a): Using MCP tools provide shortcuts that compress multiple UI actions into a few API calls; (b): Using MCP tools brings traditionally desktop-only workflows (e.g., GitHub commit search) to mobile. The user instruction for (a) is: “Compare the two apartment listings sent by the agent and determine which has the shorter driving time to Alibaba Xixi Campus (Zone C; 969 Wenyi West Road, Yuhang District, Hangzhou). Send the address of the nearer apartment to my friend Mia”.

Case Study of User Interaction

MAI-UI System Architecture
A case study of agent user interaction. The user instruction is: “In the Downloads folder, locate resume file(s) downloaded within one month and send them to my HR colleague with the subject "candidates_cv".

Citation

If you find MAI-UI useful in your research, please cite our papers:

@misc{zhou2025maiuitechnicalreportrealworld,
                    title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents}, 
                    author={Hanzhang Zhou and Xu Zhang and Panrong Tong and Jianan Zhang and Liangyu Chen and Quyu Kong and Chenglin Cai and Chen Liu and Yue Wang and Jingren Zhou and Steven Hoi},
                    year={2025},
                    eprint={2512.22047},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2512.22047}, 
              }
@misc{kong2025mobileworldbenchmarkingautonomousmobile,
                    title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments}, 
                    author={Quyu Kong and Xu Zhang and Zhenyu Yang and Nolan Gao and Chen Liu and Panrong Tong and Chenglin Cai and Hanzhang Zhou and Jianan Zhang and Liangyu Chen and Zhidan Liu and Steven Hoi and Yue Wang},
                    year={2025},
                    eprint={2512.19432},
                    archivePrefix={arXiv},
                    primaryClass={cs.AI},
                    url={https://arxiv.org/abs/2512.19432}, 
              }
@misc{chen2025uiinsenhancingguigrounding,
                    title={UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning}, 
                    author={Liangyu Chen and Hanzhang Zhou and Chenglin Cai and Jianan Zhang and Panrong Tong and Quyu Kong and Xu Zhang and Chen Liu and Yuqi Liu and Wenxuan Wang and Yue Wang and Qin Jin and Steven Hoi},
                    year={2025},
                    eprint={2510.20286},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2510.20286}, 
              }