โ๏ธ LLM Models
Which model wins? Real benchmarks and head-to-head comparisons
๐ Last updated: 2026-06-10
SWE-bench Verified
Solving real GitHub issues judged by unit tests โ the key coding-agent metric
GPQA Diamond
PhD-level, Google-proof science questions โ measures depth
AIME 2025
Math olympiad โ pure multi-step reasoning
Terminal-bench
Real terminal tasks โ the closest benchmark to actual agent work
๐ Verdict
Claude Opus 4.5 wins 2 of 2 benchmarks
All models
Anthropic
Claude Opus 4.5
released 2025-11 ยท 200K ctx
Anthropic's flagship โ highest published SWE-bench Verified score at release, built for long complex work.
Claude Sonnet 4.5
released 2025-09 ยท 200K (1M beta) ctx
Anthropic's best price/performance balance โ the daily workhorse for most coder agents.
OpenAI
GPT-5
released 2025-08 ยท 400K ctx
OpenAI's flagship with fast/deep routing โ very strong at math and reasoning.
OpenAI (open-weight)
gpt-oss-120b
released 2025-08 ยท 128K ctx
OpenAI's first open-weight model in years โ Apache 2.0, runs locally or via Ollama Cloud. Surprising performance for its size and cost.
Gemini 3 Pro
released 2025-11 ยท 1M ctx
Google's strong card โ highest published GPQA of its generation and a 1M-token context.
Moonshot AI
Kimi K2 Thinking
released 2025-11 ยท 256K ctx
The Chinese open giant โ trillion-parameter MoE delivering top-tier performance at a much lower API price.