Use
case
Complex Reasoning
About this
leaderboard
The Benchmark Hub complex reasoning leaderboard rigorously evaluates frontier models across math proofs, coding puzzles, temporal and spatial reasoning, multi-step logic, and abstract problem-solving. Scores come from curated suites, ensemble judges, and consistency checks to highlight systems that stay reliable under varied reasoning workloads.
We stress-test models with curated simulations that blend structured benchmarks and
open-ended prompts. Each table captures a distinct slice of the use case, and models are
compared consistently across shared metrics to surface leaders, trade-offs, and surprising
strengths.
|
rank
|
model
|
overall_score
|
|
1
|
OpenAI GPT-5
|
92.20%
|
|
2
|
xAI Grok 4
|
81.80%
|
|
3
|
OpenAI o3
|
75.30%
|
|
4
|
Google Gemini 2.5 Pro
|
69.20%
|
|
5
|
Qwen 3.0
|
66.00%
|
|
6
|
OpenAI o4-mini-high
|
63.90%
|
|
7
|
DeepSeek R1
|
62.03%
|
|
8
|
Claude 4.0 Opus
|
60.77%
|
|
9
|
xAI Grok 3
|
59.23%
|
|
10
|
OpenAI GPT-4.1
|
53.10%
|
|
11
|
OpenAI o3-mini-high
|
51.60%
|
|
12
|
Amazon Nova Premier
|
48.14%
|
|
13
|
Claude 3.7 Sonnet (Thinking)
|
46.80%
|
|
14
|
OpenAI o1 Full
|
44.40%
|
|
rank
|
model
|
overall_score
|
|
1
|
OpenAI GPT-5
|
93.00%
|
|
2
|
OpenAI o3
|
91.10%
|
|
3
|
xAI Grok 4
|
84.80%
|
|
4
|
OpenAI o4-mini-high
|
84.60%
|
|
5
|
OpenAI GPT-4.1
|
71.60%
|
|
6
|
Claude 4.0 Opus
|
71.08%
|
|
7
|
Google Gemini 2.5 Pro
|
70.50%
|
|
8
|
OpenAI o3-mini-high
|
68.40%
|
|
9
|
DeepSeek R1
|
65.90%
|
|
10
|
Qwen 3.0
|
62.68%
|
|
11
|
xAI Grok 4
|
59.23%
|
|
12
|
OpenAI o1 Full
|
57.80%
|
|
13
|
Amazon Nova Premier
|
41.33%
|
|
14
|
Claude 3.7 Sonnet (Thinking)
|
30.40%
|
|
rank
|
model
|
overall_score
|
|
1
|
OpenAI GPT-5
|
86.30%
|
|
2
|
xAI Grok 4
|
79.90%
|
|
3
|
OpenAI o3
|
77.80%
|
|
4
|
OpenAI o4-mini-high
|
70.70%
|
|
5
|
Qwen 3.0
|
70.10%
|
|
6
|
xAI Grok 3
|
65.71%
|
|
7
|
Google Gemini 2.5 Pro
|
63.40%
|
|
8
|
Claude 4.0 Opus
|
61.51%
|
|
9
|
OpenAI GPT-4.1
|
59.80%
|
|
10
|
DeepSeek R1
|
58.60%
|
|
11
|
OpenAI o3-mini-high
|
57.20%
|
|
12
|
OpenAI o1 Full
|
55.60%
|
|
13
|
Amazon Nova Premier
|
52.83%
|
|
14
|
Claude 3.7 Sonnet (Thinking)
|
40.00%
|
|
rank
|
model
|
overall_score
|
|
1
|
OpenAI GPT-5
|
88.70%
|
|
2
|
OpenAI o3
|
82.60%
|
|
3
|
xAI Grok 4
|
75.40%
|
|
4
|
Claude 4.0 Opus
|
63.14%
|
|
5
|
OpenAI o4-mini-high
|
61.50%
|
|
6
|
Qwen 3.0
|
60.70%
|
|
6
|
Google Gemini 2.5 Pro
|
60.70%
|
|
8
|
xAI Grok 3
|
58.46%
|
|
9
|
Amazon Nova Premier
|
55.65%
|
|
10
|
OpenAI GPT-4.1
|
52.00%
|
|
11
|
OpenAI o3-mini-high
|
49.70%
|
|
12
|
DeepSeek R1
|
49.30%
|
|
13
|
OpenAI o1 Full
|
45.60%
|
|
14
|
Claude 3.7 Sonnet (Thinking)
|
36.10%
|
|
rank
|
model
|
overall_score
|
|
1
|
OpenAI GPT-5
|
87.40%
|
|
2
|
OpenAI o3
|
80.10%
|
|
3
|
OpenAI o4-mini-high
|
78.20%
|
|
4
|
xAI Grok 4
|
77.80%
|
|
5
|
Qwen 3.0
|
76.50%
|
|
6
|
Google Gemini 2.5 Pro
|
70.60%
|
|
7
|
Claude 4.0 Opus
|
68.30%
|
|
8
|
OpenAI GPT-4.1
|
66.20%
|
|
9
|
OpenAI o3-mini-high
|
63.20%
|
|
10
|
xAI Grok 3
|
62.26%
|
|
11
|
DeepSeek R1
|
58.80%
|
|
12
|
OpenAI o1 Full
|
55.90%
|
|
13
|
Amazon Nova Premier
|
46.81%
|
|
14
|
Claude 3.7 Sonnet (Thinking)
|
37.30%
|
|
rank
|
model
|
overall_score
|
|
1
|
OpenAI GPT-5
|
89.60%
|
|
2
|
OpenAI o3
|
86.40%
|
|
3
|
xAI Grok 4
|
80.70%
|
|
4
|
OpenAI o4-mini-high
|
78.50%
|
|
5
|
Qwen 3.0
|
72.40%
|
|
6
|
Google Gemini 2.5 Pro
|
70.20%
|
|
7
|
Claude 4.0 Opus
|
68.22%
|
|
8
|
OpenAI GPT-4.1
|
66.42%
|
|
9
|
OpenAI o3-mini-high
|
63.43%
|
|
10
|
xAI Grok 3
|
62.96%
|
|
11
|
DeepSeek R1
|
62.13%
|
|
12
|
OpenAI o1 Full
|
56.26%
|
|
13
|
Amazon Nova Premier
|
50.60%
|
|
14
|
Claude 3.7 Sonnet (Thinking)
|
50.50%
|