๐Ÿ“กLearn-in-Depth

โš”๏ธ LLM Models

Which model wins? Real benchmarks and head-to-head comparisons

๐Ÿ“… Last updated: 2026-06-10

VS

SWE-bench Verified

Solving real GitHub issues judged by unit tests โ€” the key coding-agent metric

Claude Opus 4.580.9%
Claude Sonnet 4.577.2%

GPQA Diamond

PhD-level, Google-proof science questions โ€” measures depth

Claude Opus 4.587%
Claude Sonnet 4.583.4%

AIME 2025

Math olympiad โ€” pure multi-step reasoning

Claude Opus 4.5no published score
Claude Sonnet 4.587%

Terminal-bench

Real terminal tasks โ€” the closest benchmark to actual agent work

Claude Opus 4.5no published score
Claude Sonnet 4.550%

๐Ÿ‹ Verdict

Claude Opus 4.5 wins 2 of 2 benchmarks

All models

Anthropic

Claude Opus 4.5

released 2025-11 ยท 200K ctx

Anthropic's flagship โ€” highest published SWE-bench Verified score at release, built for long complex work.

swe-bench-verified: 80.9%gpqa-diamond: 87%

Claude Sonnet 4.5

released 2025-09 ยท 200K (1M beta) ctx

Anthropic's best price/performance balance โ€” the daily workhorse for most coder agents.

swe-bench-verified: 77.2%gpqa-diamond: 83.4%aime-2025: 87%

OpenAI

GPT-5

released 2025-08 ยท 400K ctx

OpenAI's flagship with fast/deep routing โ€” very strong at math and reasoning.

swe-bench-verified: 74.9%gpqa-diamond: 85.7%aime-2025: 94.6%

OpenAI (open-weight)

gpt-oss-120b

released 2025-08 ยท 128K ctx

OpenAI's first open-weight model in years โ€” Apache 2.0, runs locally or via Ollama Cloud. Surprising performance for its size and cost.

swe-bench-verified: 62.4%gpqa-diamond: 80.1%

Google

Gemini 3 Pro

released 2025-11 ยท 1M ctx

Google's strong card โ€” highest published GPQA of its generation and a 1M-token context.

swe-bench-verified: 76.2%gpqa-diamond: 91.9%aime-2025: 95%

Moonshot AI

Kimi K2 Thinking

released 2025-11 ยท 256K ctx

The Chinese open giant โ€” trillion-parameter MoE delivering top-tier performance at a much lower API price.

swe-bench-verified: 71.3%gpqa-diamond: 84.5%