Use case

Deep Research Agent

About this leaderboard

Evaluates long-horizon research agents on multi-hop reasoning, source aggregation, citation faithfulness, synthesis clarity, and domain breadth (STEM, humanities, policy).

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

Composite

rank model overall_score
1 Google Gemini 2.5 Pro 77.55%
2 OpenAI GPT-4 71.46%
3 Anthropic Claude 4.0 Opus 69.05%

STEM

rank model overall_score
1 Google Gemini 2.5 Pro 79.40%
2 OpenAI GPT-4 72.70%
3 Anthropic Claude 4.0 Opus 69.80%

Economics, policts and global affairs

rank model overall_score
1 Google Gemini 2.5 Pro 75.90%
2 OpenAI GPT-4 70.60%
3 Anthropic Claude 4.0 Opus 67.40%

Humanities & arts

rank model overall_score
1 Google Gemini 2.5 Pro 74.10%
2 OpenAI GPT-4 71.80%
3 Anthropic Claude 4.0 Opus 68.80%