📡Learn-in-Depth
Back to Latest News📅 2026-06-102 min read

The benchmark arms race — numbers that move monthly, and what they really mean

#benchmarks#swe-bench#gpqa#evaluation

Every new model launch leads with a benchmark slide, numbers smashing each other. The question that should stop you: how are these numbers measured, and who checks for games?

The three benchmarks that matter most in agent-land:

SWE-bench Verified — the king of coding evals. Genius in its simplicity: take real issues from real GitHub repos — not synthetic puzzles — let the model solve them, and let the project's own unit tests be the judge. The Verified edition is 500 human-reviewed problems, created after broken tasks were found in the original set. Scores moved from ~4% in 2023 to above 80% for today's top models — the fastest capability jump ever recorded in the field.

GPQA Diamond — PhD-level science questions deliberately built to be Google-proof. Measures depth and inference, not recall.

AIME — American math-olympiad problems. Pure multi-step reasoning: one slip at step 3 kills the answer.

So just pick the highest number? No — and here's the catch. First, contamination: the model may have seen the problems in training — memorization posing as understanding. Second, a 2–3 point gap between two models is often within noise. Third — most important — benchmarks measure the model in a lab; you run it in a life: legacy codebases, missing requirements, changing minds. Note also that SWE-bench scores depend heavily on the harness driving the model, not the model alone.

Takeaways:

  1. Benchmarks are excellent as a first filter — a 30% SWE-bench model won't do miracles for you.
  2. Above a certain level, small gaps between top models are decided by harness and price, not points.
  3. The final judge is your own eval — a set of tasks from your real work you rerun on every new model. One hour of setup saves months of regret.

To apply that advice, we built the Models comparison page — pick two models, see their numbers head-to-head. Just remember while comparing: the number is the start of the story, not the end. The end is you — and what your work actually needs.

🔗 Sources

🧭 Related to this