Use
case
Deep Research Agent
About this
leaderboard
Evaluates long-horizon research agents on multi-hop reasoning, source aggregation, citation faithfulness, synthesis clarity, and domain breadth (STEM, humanities, policy).
We stress-test models with curated simulations that blend structured benchmarks and
open-ended prompts. Each table captures a distinct slice of the use case, and models are
compared consistently across shared metrics to surface leaders, trade-offs, and surprising
strengths.
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro
|
77.55%
|
|
2
|
OpenAI GPT-4
|
71.46%
|
|
3
|
Anthropic Claude 4.0 Opus
|
69.05%
|
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro
|
79.40%
|
|
2
|
OpenAI GPT-4
|
72.70%
|
|
3
|
Anthropic Claude 4.0 Opus
|
69.80%
|
Economics, policts and global affairs
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro
|
75.90%
|
|
2
|
OpenAI GPT-4
|
70.60%
|
|
3
|
Anthropic Claude 4.0 Opus
|
67.40%
|
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro
|
74.10%
|
|
2
|
OpenAI GPT-4
|
71.80%
|
|
3
|
Anthropic Claude 4.0 Opus
|
68.80%
|