QED Bench

The destination is just the beginning.

About

QED Bench evaluates frontier models across competitive math and rigorous proof-based benchmarks. While we feature a traditional leaderboard, we contextualize each score by analyzing the underlying reasoning required to achieve it.

Outcome-based contests like AIME, HMMT, PUMaC, and Putnam provide the foundation. They offer well-designed problems for measuring whether models can move beyond the destination of a correct answer and toward reasoning we can actually trust.

That foundation is what allows the methodology to generalize to proof-based contests like the USAMO and IMO. Only by building it carefully can we set the stage for evaluating graduate-level, PhD-level, and eventually research mathematics.

For a comprehensive look at the rationale behind this framework, please read our full paper.Paper

The proof is in the data.

High score, weak reasoning.

Accuracy can outpace veracity.

Wrong answers hinder true evaluation.

Binary scoring can overshadow valid progress.

The journey tells the story

Long-form reasoning exposes what exact match can hide.

Math coverage

AIME / HMMT / PUMaC / Putnam / USAMO / IMO

10 boards across 2025 and 2026.

QMI frontier

GPT-5.4 Pro (xhigh)

QMI 158.2

Current math leader

GPT-5.2 (xhigh)

99.72% across 89.75 / 90.

Math models

Visible across the tracked competition boards.

Field average

68.04%

Across the current math leaderboard scope.

Most represented provider

OpenAI

By model count in the current snapshot.

Front row

The names people will look for first.

GPT-5.4 (xhigh)

OpenAIClosed

95.24%

USAMO 2026

USAMO 2026 95.24%

GPT-5.2 (xhigh)

OpenAIClosed

99.17%

HMMT Nov 2025

Nov 2025 99.17%

Gemini 3.1 Pro Preview

GoogleClosed

74.40%

USAMO 2026

USAMO 2026 74.40%

GLM 5

Z.aiOpen

35.12%

USAMO 2026

USAMO 2026 35.12%

Step 3.5 Flash

StepFunOpen

44.64%

USAMO 2026

USAMO 2026 44.64%

Claude-Opus-4.6 (High)

AnthropicClosed

50%

USAMO 2026

USAMO 2026 50%

Kimi K2.5 (Think)

Moonshot AIOpen

87.12%

HMMT Feb 2026

Feb 2026 87.12%

GPT-5.2 (high)

OpenAIClosed

96.97%

HMMT Feb 2026

Feb 2026 96.97%

Math coverage

Outcome-based contests build the base. Proof-based contests test whether it holds.

AIME

AIME 2025 I

71.52%

15 scored problems in this board.

Top model: GPT-5.2 (high)61 models evaluated

AIME

AIME 2025 II

76.94%

15 scored problems in this board.

Top model: DeepSeek-v3.2-Speciale61 models evaluated

HMMT

HMMT Feb 2025

64.90%

30 scored problems in this board.

Top model: GPT-5.2 (xhigh)60 models evaluated

HMMT

HMMT Nov 2025

88.59%

30 scored problems in this board.

Top model: GPT-5.2 (xhigh)23 models evaluated

IMO

IMO 2025

20.11%

6 scored problems in this board.

Top model: GPT-5 (high)7 models evaluated

USAMO

USAMO 2025

7.83%

5 scored problems in this board.

Top model: DeepSeek-R1-052810 models evaluated

AIME

AIME 2026 I

91.89%

15 scored problems in this board.

Top model: GPT-5.4 (xhigh)19 models evaluated

AIME

AIME 2026 II

94.33%

15 scored problems in this board.

Top model: Claude-Opus-4.6 (High)19 models evaluated

HMMT

HMMT Feb 2026

83.85%

33 scored problems in this board.

Top model: GPT-5.4 (xhigh)19 models evaluated

USAMO

USAMO 2026

55.95%

6 scored problems in this board.

Top model: GPT-5.4 (xhigh)6 models evaluated

Math leaderboard

Aggregate the selected family, or keep it on all math for the broader picture.

GPT-5.2 (xhigh)

Front rowOpenAIClosed

openai/gpt-52

2025 I 100%2025 II 100%Feb 2025 100%Nov 2025 99.17%

99.72%

89.75 / 90 pts

GPT-5.4 (xhigh)

Front rowOpenAIClosed

openai/gpt-54

2026 I 98.34%2026 II 100%Feb 2026 97.73%USAMO 2026 95.24%

98.13%

67.71 / 69 pts

GPT-5.2 (high)

Front rowOpenAIClosed

openai/gpt-52-high

2025 I 100%2025 II 100%Feb 2025 98.33%Nov 2025 95.83%2026 I 96.66%2026 II 100%

97.88%

149.75 / 153 pts

DeepSeek-v3.2-Speciale

DeepSeekOpen

deepseek/deepseek_v32_special

2025 I 91.66%2025 II 100%Feb 2025 97.50%Nov 2025 93.33%

95.56%

86.0 / 90 pts

Gemini 3 Flash

GoogleClosed

gemini/gemini-3-flash

2025 I 95%2025 II 100%Feb 2025 97.50%Nov 2025 93.33%2026 I 94.44%2026 II 97.22%

94.61%

144.75 / 153 pts

Gemini 3.1 Pro Preview

Front rowGoogleClosed

gemini/gemini-31-pro

2026 I 96.66%2026 II 100%Feb 2026 94.70%USAMO 2026 74.40%

94.51%

65.21 / 69 pts

Claude-Opus-4.6 (High)

Front rowAnthropicClosed

anthropic/opus_46

2026 I 93.33%2026 II 100%Feb 2026 96.21%USAMO 2026 50%

93.66%

62.75 / 67 pts

GPT-5.1 (high)

OpenAIClosed

openai/gpt-51

2025 I 90.84%2025 II 97.50%Feb 2025 93.33%Nov 2025 91.67%

93.07%

83.76 / 90 pts

Step 3.5 Flash

Front rowStepFunOpen

stepfun/3.5-flash

2025 I 96.66%2025 II 100%Feb 2025 98.33%Nov 2025 94.17%2026 I 97.78%2026 II 95.56%

93.03%

146.98 / 158 pts

#10

Gemini 3 Pro (preview)

GoogleClosed

gemini/riftrunner

2025 I 90%2025 II 100%Feb 2025 97.50%Nov 2025 93.33%2026 I 90%2026 II 93.34%

92.65%

141.75 / 153 pts

#11

GLM 4.6

Z.aiOpen

glm/glm-46

2025 I 88.89%2025 II 94.45%Feb 2025 93.33%Nov 2025 91.67%

92.22%

83.0 / 90 pts

#12

Kimi K2.5 (Think)

Front rowMoonshot AIOpen

moonshot/k25

2025 I 91.66%2025 II 100%Feb 2025 93.33%Nov 2025 89.17%2026 I 93.33%2026 II 98.33%

92.16%

141.0 / 153 pts

#13

GLM 5

Front rowZ.aiOpen

glm/glm-5

2025 I 93.34%2025 II 100%Feb 2025 97.50%Nov 2025 94.17%2026 I 91.66%2026 II 100%

91.74%

145.86 / 159 pts

#14

Kimi K2 Thinking

Moonshot AIOpen

moonshot/k2-thinking

2025 I 87.78%2025 II 97.22%Feb 2025 93.33%Nov 2025 89.17%

91.67%

82.5 / 90 pts

Board focus

Switch boards and see who clears that exact paper.

Board average

94.33%

Best model

Claude-Opus-4.6 (High)

Model explorer

Follow one model across every board that currently lands in this snapshot.

OpenAIClosed modelFamilies: AIME, HMMT, USAMOScope: 98.13% across 67.71 / 69

Coverage

Read each result as one rung in a broader reasoning ladder.

AIME, HMMT, PUMaC, and Putnam sharpen outcome-based evaluation on strong contest problems.

USAMO and IMO ask whether that strength survives into proof.

That progression is what prepares evaluation for graduate and research mathematics.

Frontier performance across benchmarks

Where the frontier moves. Hover a point for the model.

QED Math Index

Release date versus capability. Labels stay on the moves that matter.

342 releases shown from 2023 to 2026

FrontierOpenAIGoogleAnthropicMeta AIxAIOther

Hover any point for the model behind the move.

QED Math Index leaders

The latest visible moves at the top of the QMI curve.

GPT-5.4 Pro (xhigh)

OpenAIAPI access

2026-03-05

158.2

QMI

Gemini 3.1 Pro Preview

Google DeepMindAPI access

2026-02-19

157.1

QMI

Claude Opus 4.6 (120k thinking)

AnthropicAPI access

2026-02-05

155.1

QMI

GPT-5.2 (high)

OpenAIAPI access

2025-12-11

153.8

QMI

Gemini 3 Pro Preview

Google DeepMindAPI access

2025-11-18

153.4

QMI

GPT-5 Pro

OpenAIAPI access

2025-10-07

150.3

QMI