Research updates and benchmark analysis from the MobileWorld team.
We ran two recent model upgrades — Anthropic's Opus 4.6 → 4.7 and Moonshot's Kimi K2.5 → K2.6 — through MobileWorld, our 161-task benchmark for mobile GUI agents. Opus 4.7 outperforms 4.6 by +15.5pp; Kimi K2.6 outperforms K2.5 by +9.3pp. The single largest source of improvement in both is loop-breaking.
We adapted Gemini 3 Pro, Claude Sonnet 4.5, Kimi K2.5, Qwen-3.5, and Seed-2.0-Pro for end-to-end mobile-use evaluation on MobileWorld — no separate grounding module. Seed-2.0-Pro tops the leaderboard at 62.7%, surpassing the best agentic framework. We share the configurations, cost breakdown, and a 6-step recipe to run frontier LLMs on a real phone.
Reproducing reported GUI grounding numbers takes more than the model — it takes the right coordinate system, image resize, and zoom-in tool. We benchmark Gemini-3-Pro, Claude Sonnet 4.5, Seed1.8, Kimi-K2.5, and MAI-UI on OSWorld-G and ScreenSpot-Pro, and document the configurations that close gaps as large as 33 points.