Use case

Deep Research Agent

← Back to leaderboard

About this leaderboard

Evaluates long-horizon research agents on multi-hop reasoning, source aggregation, citation faithfulness, synthesis clarity, and domain breadth (STEM, humanities, policy).

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

Composite

rank	model	overall_score
1	Google Gemini 2.5 Pro	77.55%
2	OpenAI GPT-4	71.46%
3	Anthropic Claude 4.0 Opus	69.05%

STEM

rank	model	overall_score
1	Google Gemini 2.5 Pro	79.40%
2	OpenAI GPT-4	72.70%
3	Anthropic Claude 4.0 Opus	69.80%

Economics, policts and global affairs

rank	model	overall_score
1	Google Gemini 2.5 Pro	75.90%
2	OpenAI GPT-4	70.60%
3	Anthropic Claude 4.0 Opus	67.40%

Humanities & arts

rank	model	overall_score
1	Google Gemini 2.5 Pro	74.10%
2	OpenAI GPT-4	71.80%
3	Anthropic Claude 4.0 Opus	68.80%