MAI-UI: Real-World Centric Foundation GUI Agents

Hanzhang Zhou*, Xu Zhang*, Panrong Tong, Jianan Zhang, Liangyu Chen,
Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven HOI,
Tongyi Lab, Alibaba Group
*Lead contributors
All authors are core contributors
Corresponding to yue.w@alibaba-inc.com

MAI-UI is a family of foundational GUI agent models from Tongyi-MAI Lab, ranging from 2B to 235B.

Overall Performance.

MAI-UI achieves SOTA performance across 5 GUI grounding benchmarks and navigation benchmarks (AndroidWorld, MobileWorld).

Technical Highlights

For the first time, MAI-UI natively integrates three core capabilities—user interaction, MCP tool calling, and device-cloud collaboration—into a unified architecture through autonomous evolution data pipelines and large-scale online reinforcement learning technology. (Currently, 2B and 8B models are open-sourced.)

MCP Tool Usage

Model Context Protocol tools for enhanced functionality.

User Interaction

Advanced user interaction capabilities in real-world scenarios.

Online Reinforcement Learning

Large-scale online RL for continuous model improvement and adaptation.

Device-Cloud Collaboration

Efficient collaboration between device and cloud for balanced performance.

Real-World Demos

Watch MAI-UI in action across different real-world scenarios.

Demo 1: Office Scenario

Demo 2: Daily Life Scenario

Demo 3: Shopping Scenario

Demo 4: Travel Scenario

Device-Cloud Collaboration

Device-Cloud Collaboration: Simple Tasks

Device-Cloud Collaboration: Complex Tasks

Evaluating in Real-World MobileWorld

We also introduce MobileWorld benchmark: While maintaining the same level of rigorous, reproducible evaluation as AndroidWorld, MobileWorld offers a more challenging online mobile-use benchmark by introducing four additional features that better capture real-world agent behavior.

🎯
Broad Real-World Coverage
201 carefully curated tasks across 20 mobile applications
🔄
Long-Horizon Tasks
Multi-step reasoning and cross-app workflows
👥
Agent-User Interaction
Novel tasks requiring dynamic human-agent collaboration
🔧
MCP-Augmented Tasks
Support Model Context Protocol (MCP) to evaluate hybrid tool usage
Comparison between AndroidWorld and MobileWorld. Mobile World features harder tasks with more steps, more cross-app workflows, and achieves lower SOTA accuracy, demonstrating its increased difficulty.

GUI Grounding Performance

ScreenSpot-Pro
Model Avg
Gemini-3-Pro 72.7
Seed1.8 73.1
GTA1-7B 50.1
UI-Venus-7B 50.8
GUI-Owl-7B 54.9
GUI-Owl-32B 58.0
GTA1-32B 63.6
UI-Venus-72B 61.9
UI-MAI-2B 57.4
+ Zoom-In 62.8
UI-MAI-8B 65.8
+ Zoom-In 70.9
UI-MAI-32B 67.9
+ Zoom-In 73.5
UI-Vision
Model Avg
InfiGUI-G1-3B 22.0
OS-Altas-7B 9.0
UI-Tars-1.5-7B 22.3
UI-Venus-7B 26.5
InfiGUI-G1-7B 26.1
Phi-Ground 27.2
UI-TARS-72B 25.5
UI-Venus-72B 36.8
UI-MAI-2B 30.3
+ Zoom-In 31.9
UI-MAI-8B 40.7
+ Zoom-In 42.4
UI-MAI-32B 47.1
+ Zoom-In 49.2
MMBench-GUI L2
Model Avg
InfiGUI-G1-3B 73.4
OS-Atlas-7B 41.4
UI-TARS-1.5-7B 64.3
UGround-V1-7B 65.7
GTA1-7B 78.5
GUI-Owl-7B 80.5
InfiGUI-G1-7B 80.8
GUI-Owl-32B 83.0
GTA1-32B 83.4
UI-TARS-DPO-72B 74.3
InternVL3-78B 72.2
UI-MAI-2B 82.6
UI-MAI-8B 88.8
UI-MAI-32B 91.3
OSWorld-G
Agent Model Avg
UI-TARS-1.5-7B 52.8
GTA1-7B 55.1
GUI-Owl-7B 55.9
UI-Venus-7B 58.8
OpenCUA-32B 59.6
GUI-Owl-32B 58.0
GTA1-32B 65.2
UI-Venus-72B 70.4
UI-MAI-2B 52.0
+ Zoom-In 55.9
UI-MAI-8B 60.1
+ Zoom-In 64.2
UI-MAI-32B 67.6
+ Zoom-In 70.9
OSWorld-G-Refine
Agent Model Avg
Operator 57.8
Jedi-3B 61.0
Jedi-7B 63.8
UI-TARS-1.5-7B 64.2
GTA1-7B 67.7
Qwen2.5-VL-32B 59.6
OpenCUA-32B 70.2
GTA1-32B 72.2
UI-MAI-2B 63.5
+ Zoom-In 66.3
UI-MAI-8B 68.6
+ Zoom-In 72.9
UI-MAI-32B 73.9
+ Zoom-In 75.0
ScreenSpot-V2
Model Avg
Phi-ground 83.8
OS-Atlas-7B 85.1
UI-Tars-1.5-7B 89.0
OpenCUA-7B 92.3
GTA1-7B 92.4
GUI-Owl-7B 92.8
UI-Venus-7B 94.1
GUI-Owl-32B 93.2
OpenCUA-32B 93.4
GTA1-32B 95.2
UI-Venus-72B 95.3
UI-MAI-2B 92.5
UI-MAI-8B 95.2
UI-MAI-32B 96.5

Citation

If you find MAI-UI useful in your research, please cite our papers:

@misc{zhou2025maiuitechnicalreportrealworld,
                    title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents}, 
                    author={Hanzhang Zhou and Xu Zhang and Panrong Tong and Jianan Zhang and Liangyu Chen and Quyu Kong and Chenglin Cai and Chen Liu and Yue Wang and Jingren Zhou and Steven Hoi},
                    year={2025},
                    eprint={2512.22047},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2512.22047}, 
              }
@misc{kong2025mobileworldbenchmarkingautonomousmobile,
                    title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments}, 
                    author={Quyu Kong and Xu Zhang and Zhenyu Yang and Nolan Gao and Chen Liu and Panrong Tong and Chenglin Cai and Hanzhang Zhou and Jianan Zhang and Liangyu Chen and Zhidan Liu and Steven Hoi and Yue Wang},
                    year={2025},
                    eprint={2512.19432},
                    archivePrefix={arXiv},
                    primaryClass={cs.AI},
                    url={https://arxiv.org/abs/2512.19432}, 
              }
@misc{chen2025uiinsenhancingguigrounding,
                    title={UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning}, 
                    author={Liangyu Chen and Hanzhang Zhou and Chenglin Cai and Jianan Zhang and Panrong Tong and Quyu Kong and Xu Zhang and Chen Liu and Yuqi Liu and Wenxuan Wang and Yue Wang and Qin Jin and Steven Hoi},
                    year={2025},
                    eprint={2510.20286},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2510.20286}, 
              }