Arena

Compare two models head-to-head on individual tasks. Select models, filter by outcome, then inspect their trajectories side-by-side to understand differences in reasoning and navigation.

vs