MAI-UI: Real-World Centric Foundation GUI Agents

Hanzhang Zhou^*, Xu Zhang^*, Panrong Tong, Jianan Zhang, Liangyu Chen,
Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang^†, Jingren Zhou, Steven HOI,

Tongyi Lab, Alibaba Group

^*Lead contributors

All authors are core contributors

^†Corresponding to yue.w@alibaba-inc.com

Paper Code HuggingFace ModelScope MobileWorld Cite

MAI-UI is a family of foundational GUI agent models from Tongyi-MAI Lab, ranging from 2B to 235B.

Watch Demo

Overall Performance.

MAI-UI achieves SOTA performance across 5 GUI grounding benchmarks and navigation benchmarks (AndroidWorld, MobileWorld).

Technical Highlights

For the first time, MAI-UI natively integrates three core capabilities—user interaction, MCP tool calling, and device-cloud collaboration—into a unified architecture through autonomous evolution data pipelines and large-scale online reinforcement learning technology. (Currently, 2B and 8B models are open-sourced.)

MCP Tool Usage

Model Context Protocol tools for enhanced functionality.

User Interaction

Advanced user interaction capabilities in real-world scenarios.

Online Reinforcement Learning

Large-scale online RL for continuous model improvement and adaptation.

Device-Cloud Collaboration

Efficient collaboration between device and cloud for balanced performance.

Real-World Demos

Watch MAI-UI in action across different real-world scenarios.

Demo 1: Office Scenario

Demo 2: Daily Life Scenario

Demo 3: Shopping Scenario

Demo 4: Travel Scenario

Device-Cloud Collaboration

Device-Cloud Collaboration: Simple Tasks

Device-Cloud Collaboration: Complex Tasks

Evaluating in Real-World MobileWorld

We also introduce MobileWorld benchmark: While maintaining the same level of rigorous, reproducible evaluation as AndroidWorld, MobileWorld offers a more challenging online mobile-use benchmark by introducing four additional features that better capture real-world agent behavior.

🎯

Broad Real-World Coverage

201 carefully curated tasks across 20 mobile applications

🔄

Long-Horizon Tasks

Multi-step reasoning and cross-app workflows

👥

Agent-User Interaction

Novel tasks requiring dynamic human-agent collaboration

🔧

MCP-Augmented Tasks

Support Model Context Protocol (MCP) to evaluate hybrid tool usage

Comparison between AndroidWorld and MobileWorld. Mobile World features harder tasks with more steps, more cross-app workflows, and achieves lower SOTA accuracy, demonstrating its increased difficulty.

GUI Grounding Performance

ScreenSpot-Pro

Model	Avg
Gemini-3-Pro	72.7
Seed1.8	73.1
GTA1-7B	50.1
UI-Venus-7B	50.8
GUI-Owl-7B	54.9
GUI-Owl-32B	58.0
GTA1-32B	63.6
UI-Venus-72B	61.9
MAI-UI-2B	57.4
+ Zoom-In	62.8
MAI-UI-8B	65.8
+ Zoom-In	70.9
MAI-UI-32B	67.9
+ Zoom-In	73.5

UI-Vision

Model	Avg
InfiGUI-G1-3B	22.0
OS-Altas-7B	9.0
UI-Tars-1.5-7B	22.3
UI-Venus-7B	26.5
InfiGUI-G1-7B	26.1
Phi-Ground	27.2
UI-TARS-72B	25.5
UI-Venus-72B	36.8
MAI-UI-2B	30.3
+ Zoom-In	31.9
MAI-UI-8B	40.7
+ Zoom-In	42.4
MAI-UI-32B	47.1
+ Zoom-In	49.2

MMBench-GUI L2

Model	Avg
InfiGUI-G1-3B	73.4
OS-Atlas-7B	41.4
UI-TARS-1.5-7B	64.3
UGround-V1-7B	65.7
GTA1-7B	78.5
GUI-Owl-7B	80.5
InfiGUI-G1-7B	80.8
GUI-Owl-32B	83.0
GTA1-32B	83.4
UI-TARS-DPO-72B	74.3
InternVL3-78B	72.2
MAI-UI-2B	82.6
MAI-UI-8B	88.8
MAI-UI-32B	91.3

OSWorld-G

Agent Model	Avg
UI-TARS-1.5-7B	52.8
GTA1-7B	55.1
GUI-Owl-7B	55.9
UI-Venus-7B	58.8
OpenCUA-32B	59.6
GUI-Owl-32B	58.0
GTA1-32B	65.2
UI-Venus-72B	70.4
MAI-UI-2B	52.0
+ Zoom-In	55.9
MAI-UI-8B	60.1
+ Zoom-In	64.2
MAI-UI-32B	67.6
+ Zoom-In	70.9

OSWorld-G-Refine

Agent Model	Avg
Operator	57.8
Jedi-3B	61.0
Jedi-7B	63.8
UI-TARS-1.5-7B	64.2
GTA1-7B	67.7
Qwen2.5-VL-32B	59.6
OpenCUA-32B	70.2
GTA1-32B	72.2
MAI-UI-2B	63.5
+ Zoom-In	66.3
MAI-UI-8B	68.6
+ Zoom-In	72.9
MAI-UI-32B	73.9
+ Zoom-In	75.0

ScreenSpot-V2

Model	Avg
Phi-ground	83.8
OS-Atlas-7B	85.1
UI-Tars-1.5-7B	89.0
OpenCUA-7B	92.3
GTA1-7B	92.4
GUI-Owl-7B	92.8
UI-Venus-7B	94.1
GUI-Owl-32B	93.2
OpenCUA-32B	93.4
GTA1-32B	95.2
UI-Venus-72B	95.3
MAI-UI-2B	92.5
MAI-UI-8B	95.2
MAI-UI-32B	96.5

GUI Navigation Performance

AndroidWorld

Model	Paras.	Success Rate
Qwen3-VL-2B	2B	36.4
UI-Tars-1.5-7B	7B	30.0
UI-Venus-7B	7B	49.1
GUI-Owl-7B	7B	66.4
Step-GUI-8B	8B	67.7
Qwen3-VL-8B	8B	47.6
Qwen3-VL-32B	32B	57.3
UI-Venus-72B	72B	65.9
Qwen3-VL-235B-A22B	235B	63.7
UI-Tars-1.5	-	64.2
Gemini-2.5-Pro	-	69.7
Seed1.8	-	70.7
UI-Tars-2	230B	73.3
MAI-UI-2B	2B	49.1
MAI-UI-8B	8B	70.7
MAI-UI-32B	32B	73.3
MAI-UI-235B-A22B	235B	76.7

MobileWorld

Model	GUI-Only (116)	User-Int. (45)	MCP (40)	Overall
Agentic Framework
Claude-4.5-Sonnet + UI-Ins-7B	47.8	37.8	50.0	43.8
Gemini-3-Pro + UI-Ins-7B	55.6	24.4	48.6	46.3
GPT-5 + UI-Ins-7B	54.0	62.2	51.6	51.7
End-to-End Model
GUI-Owl-7B	7.7	-	-	4.5
GUI-Owl-32B	8.5	-	-	5.5
UI-Venus-7B	8.5	2.3	-	5.5
UI-Venus-72B	16.4	4.7	-	10.4
Qwen3-VL-8B	9.4	0.0	0.0	5.5
Qwen3-VL-32B	11.9	6.7	2.7	9.0
Qwen3-VL-235B-A22B	12.8	4.4	5.4	9.5
Doubao-1.5-UI-TARS	26.3	32.4	-	20.9
Ours
MAI-UI-8B	27.5	22.2	20.0	24.9
MAI-UI-32B	36.2	46.7	30.0	37.3
MAI-UI-235B-A22B	39.7	51.1	37.5	41.7

Citation

If you find MAI-UI useful in your research, please cite our papers:

@article{zhou2025mai,
  title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents},
  author={Zhou, Hanzhang and Zhang, Xu and Tong, Panrong and Zhang, Jianan and Chen, Liangyu and Kong, Quyu and Cai, Chenglin and Liu, Chen and Wang, Yue and Zhou, Jingren and others},
  journal={arXiv preprint arXiv:2512.22047},
  year={2025}
}

@article{kong2025mobileworld,
  title={MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments},
  author={Kong, Quyu and Zhang, Xu and Yang, Zhenyu and Gao, Nolan and Liu, Chen and Tong, Panrong and Cai, Chenglin and Zhou, Hanzhang and Zhang, Jianan and Chen, Liangyu and others},
  journal={arXiv preprint arXiv:2512.19432},
  year={2025}
}

@article{chen2025ui,
  title={UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning},
  author={Chen, Liangyu and Zhou, Hanzhang and Cai, Chenglin and Zhang, Jianan and Tong, Panrong and Kong, Quyu and Zhang, Xu and Liu, Chen and Liu, Yuqi and Wang, Wenxuan and others},
  journal={arXiv preprint arXiv:2510.20286},
  year={2025}
}