Blog

Research updates and benchmark analysis from the MobileWorld team.

Evaluating two model upgrades on MobileWorld

We ran two recent model upgrades — Anthropic's Opus 4.6 → 4.7 and Moonshot's Kimi K2.5 → K2.6 — through MobileWorld, our 161-task benchmark for mobile GUI agents. Opus 4.7 outperforms 4.6 by +15.5pp; Kimi K2.6 outperforms K2.5 by +9.3pp. The single largest source of improvement in both is loop-breaking.

Evaluation Claude Opus Kimi K2

MobileWorld Update: Can Frontier Models Really Control Your Phone?

We adapted Gemini 3 Pro, Claude Sonnet 4.5, Kimi K2.5, Qwen-3.5, and Seed-2.0-Pro for end-to-end mobile-use evaluation on MobileWorld — no separate grounding module. Seed-2.0-Pro tops the leaderboard at 62.7%, surpassing the best agentic framework. We share the configurations, cost breakdown, and a 6-step recipe to run frontier LLMs on a real phone.

End-to-End Frontier Models Real Devices

Why your AI Agent keeps misclicking? A Practical Guide to GUI Grounding

Reproducing reported GUI grounding numbers takes more than the model — it takes the right coordinate system, image resize, and zoom-in tool. We benchmark Gemini-3-Pro, Claude Sonnet 4.5, Seed1.8, Kimi-K2.5, and MAI-UI on OSWorld-G and ScreenSpot-Pro, and document the configurations that close gaps as large as 33 points.

GUI Grounding OSWorld-G ScreenSpot-Pro