⚔️ LLM Models

Which model wins? Real benchmarks and head-to-head comparisons

📅 Last updated: 2026-06-10

Pick model A

Pick model B

SWE-bench Verified

Solving real GitHub issues judged by unit tests — the key coding-agent metric

Claude Opus 4.580.9%

Claude Sonnet 4.577.2%

GPQA Diamond

PhD-level, Google-proof science questions — measures depth

Claude Opus 4.587%

Claude Sonnet 4.583.4%

AIME 2025

Math olympiad — pure multi-step reasoning

Claude Opus 4.5no published score

Claude Sonnet 4.587%

Terminal-bench

Real terminal tasks — the closest benchmark to actual agent work

Claude Opus 4.5no published score

Claude Sonnet 4.550%

🍋 Verdict

Claude Opus 4.5 wins 2 of 2 benchmarks

All models

Anthropic

Claude Opus 4.5

released 2025-11 · 200K ctx

Anthropic's flagship — highest published SWE-bench Verified score at release, built for long complex work.

swe-bench-verified: 80.9%gpqa-diamond: 87%

Claude Sonnet 4.5

released 2025-09 · 200K (1M beta) ctx

Anthropic's best price/performance balance — the daily workhorse for most coder agents.

swe-bench-verified: 77.2%gpqa-diamond: 83.4%aime-2025: 87%

OpenAI

GPT-5

released 2025-08 · 400K ctx

OpenAI's flagship with fast/deep routing — very strong at math and reasoning.

swe-bench-verified: 74.9%gpqa-diamond: 85.7%aime-2025: 94.6%

OpenAI (open-weight)

gpt-oss-120b

released 2025-08 · 128K ctx

OpenAI's first open-weight model in years — Apache 2.0, runs locally or via Ollama Cloud. Surprising performance for its size and cost.

swe-bench-verified: 62.4%gpqa-diamond: 80.1%

Google

Gemini 3 Pro

released 2025-11 · 1M ctx

Google's strong card — highest published GPQA of its generation and a 1M-token context.

swe-bench-verified: 76.2%gpqa-diamond: 91.9%aime-2025: 95%

Moonshot AI

Kimi K2 Thinking

released 2025-11 · 256K ctx

The Chinese open giant — trillion-parameter MoE delivering top-tier performance at a much lower API price.

swe-bench-verified: 71.3%gpqa-diamond: 84.5%