MAI-UI: Real-World Centric Foundation GUI Agents

Hanzhang Zhou*, Xu Zhang*, Panrong Tong, Jianan Zhang, Liangyu Chen,
Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven HOI,
Tongyi Lab, Alibaba Group
*Lead contributors
All authors are core contributors
Corresponding to yue.w@alibaba-inc.com

MAI-UI is a family of foundational GUI agent models from Tongyi-MAI Lab, ranging from 2B to 235B.

Overall Performance.

MAI-UI achieves SOTA performance across 5 GUI grounding benchmarks and navigation benchmarks (AndroidWorld, MobileWorld).

Technical Highlights

For the first time, MAI-UI natively integrates three core capabilities—user interaction, MCP tool calling, and device-cloud collaboration—into a unified architecture through autonomous evolution data pipelines and large-scale online reinforcement learning technology. (Currently, 2B and 8B models are open-sourced.)

MCP Tool Usage

Model Context Protocol tools for enhanced functionality.

User Interaction

Advanced user interaction capabilities in real-world scenarios.

Online Reinforcement Learning

Large-scale online RL for continuous model improvement and adaptation.

Device-Cloud Collaboration

Efficient collaboration between device and cloud for balanced performance.

Real-World Demos

Watch MAI-UI in action across different real-world scenarios.

Demo 1: Office Scenario

Demo 2: Daily Life Scenario

Demo 3: Shopping Scenario

Demo 4: Travel Scenario

Device-Cloud Collaboration

Device-Cloud Collaboration: Simple Tasks

Device-Cloud Collaboration: Complex Tasks

Evaluating in Real-World MobileWorld

We also introduce MobileWorld benchmark: While maintaining the same level of rigorous, reproducible evaluation as AndroidWorld, MobileWorld offers a more challenging online mobile-use benchmark by introducing four additional features that better capture real-world agent behavior.

🎯
Broad Real-World Coverage
201 carefully curated tasks across 20 mobile applications
🔄
Long-Horizon Tasks
Multi-step reasoning and cross-app workflows
👥
Agent-User Interaction
Novel tasks requiring dynamic human-agent collaboration
🔧
MCP-Augmented Tasks
Support Model Context Protocol (MCP) to evaluate hybrid tool usage
Comparison between AndroidWorld and MobileWorld. Mobile World features harder tasks with more steps, more cross-app workflows, and achieves lower SOTA accuracy, demonstrating its increased difficulty.

GUI Grounding Performance

ScreenSpot-Pro
Model Avg
Gemini-3-Pro 72.7
Seed1.8 73.1
GTA1-7B 50.1
UI-Venus-7B 50.8
GUI-Owl-7B 54.9
GUI-Owl-32B 58.0
GTA1-32B 63.6
UI-Venus-72B 61.9
MAI-UI-2B 57.4
+ Zoom-In 62.8
MAI-UI-8B 65.8
+ Zoom-In 70.9
MAI-UI-32B 67.9
+ Zoom-In 73.5
UI-Vision
Model Avg
InfiGUI-G1-3B 22.0
OS-Altas-7B 9.0
UI-Tars-1.5-7B 22.3
UI-Venus-7B 26.5
InfiGUI-G1-7B 26.1
Phi-Ground 27.2
UI-TARS-72B 25.5
UI-Venus-72B 36.8
MAI-UI-2B 30.3
+ Zoom-In 31.9
MAI-UI-8B 40.7
+ Zoom-In 42.4
MAI-UI-32B 47.1
+ Zoom-In 49.2
MMBench-GUI L2
Model Avg
InfiGUI-G1-3B 73.4
OS-Atlas-7B 41.4
UI-TARS-1.5-7B 64.3
UGround-V1-7B 65.7
GTA1-7B 78.5
GUI-Owl-7B 80.5
InfiGUI-G1-7B 80.8
GUI-Owl-32B 83.0
GTA1-32B 83.4
UI-TARS-DPO-72B 74.3
InternVL3-78B 72.2
MAI-UI-2B 82.6
MAI-UI-8B 88.8
MAI-UI-32B 91.3
OSWorld-G
Agent Model Avg
UI-TARS-1.5-7B 52.8
GTA1-7B 55.1
GUI-Owl-7B 55.9
UI-Venus-7B 58.8
OpenCUA-32B 59.6
GUI-Owl-32B 58.0
GTA1-32B 65.2
UI-Venus-72B 70.4
MAI-UI-2B 52.0
+ Zoom-In 55.9
MAI-UI-8B 60.1
+ Zoom-In 64.2
MAI-UI-32B 67.6
+ Zoom-In 70.9
OSWorld-G-Refine
Agent Model Avg
Operator 57.8
Jedi-3B 61.0
Jedi-7B 63.8
UI-TARS-1.5-7B 64.2
GTA1-7B 67.7
Qwen2.5-VL-32B 59.6
OpenCUA-32B 70.2
GTA1-32B 72.2
MAI-UI-2B 63.5
+ Zoom-In 66.3
MAI-UI-8B 68.6
+ Zoom-In 72.9
MAI-UI-32B 73.9
+ Zoom-In 75.0
ScreenSpot-V2
Model Avg
Phi-ground 83.8
OS-Atlas-7B 85.1
UI-Tars-1.5-7B 89.0
OpenCUA-7B 92.3
GTA1-7B 92.4
GUI-Owl-7B 92.8
UI-Venus-7B 94.1
GUI-Owl-32B 93.2
OpenCUA-32B 93.4
GTA1-32B 95.2
UI-Venus-72B 95.3
MAI-UI-2B 92.5
MAI-UI-8B 95.2
MAI-UI-32B 96.5

Citation

If you find MAI-UI useful in your research, please cite our papers:

@article{zhou2025mai,
  title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents},
  author={Zhou, Hanzhang and Zhang, Xu and Tong, Panrong and Zhang, Jianan and Chen, Liangyu and Kong, Quyu and Cai, Chenglin and Liu, Chen and Wang, Yue and Zhou, Jingren and others},
  journal={arXiv preprint arXiv:2512.22047},
  year={2025}
} 
              
@article{kong2025mobileworld,
  title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments},
  author={Kong, Quyu and Zhang, Xu and Yang, Zhenyu and Gao, Nolan and Liu, Chen and Tong, Panrong and Cai, Chenglin and Zhou, Hanzhang and Zhang, Jianan and Chen, Liangyu and others},
  journal={arXiv preprint arXiv:2512.19432},
  year={2025}
} 
              
@article{chen2025ui,
  title={UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning},
  author={Chen, Liangyu and Zhou, Hanzhang and Cai, Chenglin and Zhang, Jianan and Tong, Panrong and Kong, Quyu and Zhang, Xu and Liu, Chen and Liu, Yuqi and Wang, Wenxuan and others},
  journal={arXiv preprint arXiv:2510.20286},
  year={2025}
}