QED Bench

The destination is just the beginning.

OpenAI
Gemini
Grok
Claude
Qwen
Kimi
MetaAI
Stepfun
MoonshotAI
DeepSeek
OpenAI
Gemini
Grok
Claude
Qwen
Kimi
MetaAI
Stepfun
MoonshotAI
DeepSeek

About

QED Bench evaluates frontier models across competitive math and rigorous proof-based benchmarks. While we feature a traditional leaderboard, we contextualize each score by analyzing the underlying reasoning required to achieve it.

Outcome-based contests like AIME, HMMT, PUMaC, and Putnam provide the foundation. They offer well-designed problems for measuring whether models can move beyond the destination of a correct answer and toward reasoning we can actually trust.

That foundation is what allows the methodology to generalize to proof-based contests like the USAMO and IMO. Only by building it carefully can we set the stage for evaluating graduate-level, PhD-level, and eventually research mathematics.

The proof is in the data.

High score, weak reasoning.

Accuracy can outpace veracity.

Wrong answers hinder true evaluation.

Binary scoring can overshadow valid progress.

The journey tells the story

Long-form reasoning exposes what exact match can hide.

Math coverage
AIME / HMMT / PUMaC / Putnam / USAMO / IMO
10 boards across 2025 and 2026.
QMI frontier
GPT-5.4 Pro (xhigh)
QMI 158.2
Current math leader
GPT-5.2 (xhigh)
99.72% across 89.75 / 90.
Math models
74
Visible across the tracked competition boards.
Field average
68.04%
Across the current math leaderboard scope.
Most represented provider
OpenAI
By model count in the current snapshot.

Front row

The names people will look for first.
#1
GPT-5.4 (xhigh)
OpenAIClosed
95.24%
USAMO 2026
USAMO 2026 95.24%
#2
GPT-5.2 (xhigh)
OpenAIClosed
99.17%
HMMT Nov 2025
Nov 2025 99.17%
#3
Gemini 3.1 Pro Preview
GoogleClosed
74.40%
USAMO 2026
USAMO 2026 74.40%
#4
GLM 5
Z.aiOpen
35.12%
USAMO 2026
USAMO 2026 35.12%
#5
Step 3.5 Flash
StepFunOpen
44.64%
USAMO 2026
USAMO 2026 44.64%
#6
Claude-Opus-4.6 (High)
AnthropicClosed
50%
USAMO 2026
USAMO 2026 50%
#7
Kimi K2.5 (Think)
Moonshot AIOpen
87.12%
HMMT Feb 2026
Feb 2026 87.12%
#8
GPT-5.2 (high)
OpenAIClosed
96.97%
HMMT Feb 2026
Feb 2026 96.97%

Math coverage

Outcome-based contests build the base. Proof-based contests test whether it holds.
AIME

AIME 2025 I

71.52%
15 scored problems in this board.
Top model: GPT-5.2 (high)61 models evaluated
AIME

AIME 2025 II

76.94%
15 scored problems in this board.
Top model: DeepSeek-v3.2-Speciale61 models evaluated
HMMT

HMMT Feb 2025

64.90%
30 scored problems in this board.
Top model: GPT-5.2 (xhigh)60 models evaluated
HMMT

HMMT Nov 2025

88.59%
30 scored problems in this board.
Top model: GPT-5.2 (xhigh)23 models evaluated
IMO

IMO 2025

20.11%
6 scored problems in this board.
Top model: GPT-5 (high)7 models evaluated
USAMO

USAMO 2025

7.83%
5 scored problems in this board.
Top model: DeepSeek-R1-052810 models evaluated
AIME

AIME 2026 I

91.89%
15 scored problems in this board.
Top model: GPT-5.4 (xhigh)19 models evaluated
AIME

AIME 2026 II

94.33%
15 scored problems in this board.
Top model: Claude-Opus-4.6 (High)19 models evaluated
HMMT

HMMT Feb 2026

83.85%
33 scored problems in this board.
Top model: GPT-5.4 (xhigh)19 models evaluated
USAMO

USAMO 2026

55.95%
6 scored problems in this board.
Top model: GPT-5.4 (xhigh)6 models evaluated

Math leaderboard

Aggregate the selected family, or keep it on all math for the broader picture.
#1
GPT-5.2 (xhigh)
Front rowOpenAIClosed
openai/gpt-52
2025 I 100%2025 II 100%Feb 2025 100%Nov 2025 99.17%
99.72%
89.75 / 90 pts
#2
GPT-5.4 (xhigh)
Front rowOpenAIClosed
openai/gpt-54
2026 I 98.34%2026 II 100%Feb 2026 97.73%USAMO 2026 95.24%
98.13%
67.71 / 69 pts
#3
GPT-5.2 (high)
Front rowOpenAIClosed
openai/gpt-52-high
2025 I 100%2025 II 100%Feb 2025 98.33%Nov 2025 95.83%2026 I 96.66%2026 II 100%
97.88%
149.75 / 153 pts
#4
DeepSeek-v3.2-Speciale
DeepSeekOpen
deepseek/deepseek_v32_special
2025 I 91.66%2025 II 100%Feb 2025 97.50%Nov 2025 93.33%
95.56%
86.0 / 90 pts
#5
Gemini 3 Flash
GoogleClosed
gemini/gemini-3-flash
2025 I 95%2025 II 100%Feb 2025 97.50%Nov 2025 93.33%2026 I 94.44%2026 II 97.22%
94.61%
144.75 / 153 pts
#6
Gemini 3.1 Pro Preview
Front rowGoogleClosed
gemini/gemini-31-pro
2026 I 96.66%2026 II 100%Feb 2026 94.70%USAMO 2026 74.40%
94.51%
65.21 / 69 pts
#7
Claude-Opus-4.6 (High)
Front rowAnthropicClosed
anthropic/opus_46
2026 I 93.33%2026 II 100%Feb 2026 96.21%USAMO 2026 50%
93.66%
62.75 / 67 pts
#8
GPT-5.1 (high)
OpenAIClosed
openai/gpt-51
2025 I 90.84%2025 II 97.50%Feb 2025 93.33%Nov 2025 91.67%
93.07%
83.76 / 90 pts
#9
Step 3.5 Flash
Front rowStepFunOpen
stepfun/3.5-flash
2025 I 96.66%2025 II 100%Feb 2025 98.33%Nov 2025 94.17%2026 I 97.78%2026 II 95.56%
93.03%
146.98 / 158 pts
#10
Gemini 3 Pro (preview)
GoogleClosed
gemini/riftrunner
2025 I 90%2025 II 100%Feb 2025 97.50%Nov 2025 93.33%2026 I 90%2026 II 93.34%
92.65%
141.75 / 153 pts
#11
GLM 4.6
Z.aiOpen
glm/glm-46
2025 I 88.89%2025 II 94.45%Feb 2025 93.33%Nov 2025 91.67%
92.22%
83.0 / 90 pts
#12
Kimi K2.5 (Think)
Front rowMoonshot AIOpen
moonshot/k25
2025 I 91.66%2025 II 100%Feb 2025 93.33%Nov 2025 89.17%2026 I 93.33%2026 II 98.33%
92.16%
141.0 / 153 pts
#13
GLM 5
Front rowZ.aiOpen
glm/glm-5
2025 I 93.34%2025 II 100%Feb 2025 97.50%Nov 2025 94.17%2026 I 91.66%2026 II 100%
91.74%
145.86 / 159 pts
#14
Kimi K2 Thinking
Moonshot AIOpen
moonshot/k2-thinking
2025 I 87.78%2025 II 97.22%Feb 2025 93.33%Nov 2025 89.17%
91.67%
82.5 / 90 pts

Board focus

Switch boards and see who clears that exact paper.
Board average
94.33%
Best model
Claude-Opus-4.6 (High)

Model explorer

Follow one model across every board that currently lands in this snapshot.
OpenAIClosed modelFamilies: AIME, HMMT, USAMOScope: 98.13% across 67.71 / 69

Coverage

Read each result as one rung in a broader reasoning ladder.

AIME, HMMT, PUMaC, and Putnam sharpen outcome-based evaluation on strong contest problems.

USAMO and IMO ask whether that strength survives into proof.

That progression is what prepares evaluation for graduate and research mathematics.

Frontier performance across benchmarks

Where the frontier moves. Hover a point for the model.

QED Math Index

Release date versus capability. Labels stay on the moves that matter.
342 releases shown from 2023 to 2026
FrontierOpenAIGoogleAnthropicMeta AIxAIOther

Hover any point for the model behind the move.

QED Math Index leaders

The latest visible moves at the top of the QMI curve.
#1
GPT-5.4 Pro (xhigh)
OpenAIAPI access
2026-03-05
158.2
QMI
#2
Gemini 3.1 Pro Preview
Google DeepMindAPI access
2026-02-19
157.1
QMI
#3
Claude Opus 4.6 (120k thinking)
AnthropicAPI access
2026-02-05
155.1
QMI
#4
GPT-5.2 (high)
OpenAIAPI access
2025-12-11
153.8
QMI
#5
Gemini 3 Pro Preview
Google DeepMindAPI access
2025-11-18
153.4
QMI
#6
GPT-5 Pro
OpenAIAPI access
2025-10-07
150.3
QMI