Use case

Complex Reasoning

About this leaderboard

The Benchmark Hub complex reasoning leaderboard rigorously evaluates frontier models across math proofs, coding puzzles, temporal and spatial reasoning, multi-step logic, and abstract problem-solving. Scores come from curated suites, ensemble judges, and consistency checks to highlight systems that stay reliable under varied reasoning workloads.

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

Math

rank model overall_score
1 OpenAI GPT-5 92.20%
2 xAI Grok 4 81.80%
3 OpenAI o3 75.30%
4 Google Gemini 2.5 Pro 69.20%
5 Qwen 3.0 66.00%
6 OpenAI o4-mini-high 63.90%
7 DeepSeek R1 62.03%
8 Claude 4.0 Opus 60.77%
9 xAI Grok 3 59.23%
10 OpenAI GPT-4.1 53.10%
11 OpenAI o3-mini-high 51.60%
12 Amazon Nova Premier 48.14%
13 Claude 3.7 Sonnet (Thinking) 46.80%
14 OpenAI o1 Full 44.40%

Pure math

rank model overall_score
1 OpenAI GPT-5 93.00%
2 OpenAI o3 91.10%
3 xAI Grok 4 84.80%
4 OpenAI o4-mini-high 84.60%
5 OpenAI GPT-4.1 71.60%
6 Claude 4.0 Opus 71.08%
7 Google Gemini 2.5 Pro 70.50%
8 OpenAI o3-mini-high 68.40%
9 DeepSeek R1 65.90%
10 Qwen 3.0 62.68%
11 xAI Grok 4 59.23%
12 OpenAI o1 Full 57.80%
13 Amazon Nova Premier 41.33%
14 Claude 3.7 Sonnet (Thinking) 30.40%

Applied math

rank model overall_score
1 OpenAI GPT-5 86.30%
2 xAI Grok 4 79.90%
3 OpenAI o3 77.80%
4 OpenAI o4-mini-high 70.70%
5 Qwen 3.0 70.10%
6 xAI Grok 3 65.71%
7 Google Gemini 2.5 Pro 63.40%
8 Claude 4.0 Opus 61.51%
9 OpenAI GPT-4.1 59.80%
10 DeepSeek R1 58.60%
11 OpenAI o3-mini-high 57.20%
12 OpenAI o1 Full 55.60%
13 Amazon Nova Premier 52.83%
14 Claude 3.7 Sonnet (Thinking) 40.00%

Computer science

rank model overall_score
1 OpenAI GPT-5 88.70%
2 OpenAI o3 82.60%
3 xAI Grok 4 75.40%
4 Claude 4.0 Opus 63.14%
5 OpenAI o4-mini-high 61.50%
6 Qwen 3.0 60.70%
6 Google Gemini 2.5 Pro 60.70%
8 xAI Grok 3 58.46%
9 Amazon Nova Premier 55.65%
10 OpenAI GPT-4.1 52.00%
11 OpenAI o3-mini-high 49.70%
12 DeepSeek R1 49.30%
13 OpenAI o1 Full 45.60%
14 Claude 3.7 Sonnet (Thinking) 36.10%

General reasoning

rank model overall_score
1 OpenAI GPT-5 87.40%
2 OpenAI o3 80.10%
3 OpenAI o4-mini-high 78.20%
4 xAI Grok 4 77.80%
5 Qwen 3.0 76.50%
6 Google Gemini 2.5 Pro 70.60%
7 Claude 4.0 Opus 68.30%
8 OpenAI GPT-4.1 66.20%
9 OpenAI o3-mini-high 63.20%
10 xAI Grok 3 62.26%
11 DeepSeek R1 58.80%
12 OpenAI o1 Full 55.90%
13 Amazon Nova Premier 46.81%
14 Claude 3.7 Sonnet (Thinking) 37.30%

Aggregate performance

rank model overall_score
1 OpenAI GPT-5 89.60%
2 OpenAI o3 86.40%
3 xAI Grok 4 80.70%
4 OpenAI o4-mini-high 78.50%
5 Qwen 3.0 72.40%
6 Google Gemini 2.5 Pro 70.20%
7 Claude 4.0 Opus 68.22%
8 OpenAI GPT-4.1 66.42%
9 OpenAI o3-mini-high 63.43%
10 xAI Grok 3 62.96%
11 DeepSeek R1 62.13%
12 OpenAI o1 Full 56.26%
13 Amazon Nova Premier 50.60%
14 Claude 3.7 Sonnet (Thinking) 50.50%