About this
leaderboard
Measures retrieval-augmented agents on query planning, evidence recall, grounded answers, citation quality, latency, and recency across web/news domains.
We stress-test models with curated simulations that blend structured benchmarks and
open-ended prompts. Each table captures a distinct slice of the use case, and models are
compared consistently across shared metrics to surface leaders, trade-offs, and surprising
strengths.
|
rank
|
model
|
overall_score
|
|
1
|
OpenAI GPT-4.1 (Search)
|
82.00%
|
|
2
|
Google Gemini 2.5 Pro (Search)
|
78.00%
|
|
3
|
Anthropic Claude 4.0 Opus (Search)
|
72.00%
|
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro (Search)
|
77.00%
|
|
2
|
OpenAI GPT-4.1 (Search)
|
74.00%
|
|
3
|
Anthropic Claude 4.0 Opus (Search)
|
71.00%
|
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro (Search)
|
80.00%
|
|
2
|
OpenAI GPT-4.1 (Search)
|
72.50%
|
|
3
|
Anthropic Claude 4.0 Opus (Search)
|
70.00%
|
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro (Search)
|
82.00%
|
|
2
|
Anthropic Claude 4.0 Opus (Search)
|
69.00%
|
|
3
|
OpenAI GPT-4.1 (Search)
|
68.00%
|
Faulty adversarial prompts
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro (Search)
|
83.00%
|
|
2
|
OpenAI GPT-4.1 (Search)
|
62.00%
|
|
3
|
Anthropic Claude 4.0 Opus (Search)
|
58.00%
|
Recent news current events
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro (Search)
|
89.00%
|
|
2
|
OpenAI GPT-4.1 (Search)
|
68.00%
|
|
3
|
Anthropic Claude 4.0 Opus (Search)
|
64.00%
|
Specialized domain knowledge
|
rank
|
model
|
overall_score
|
|
1
|
Google Gemini 2.5 Pro (Search)
|
74.00%
|
|
2
|
OpenAI GPT-4.1 (Search)
|
71.00%
|
|
3
|
Anthropic Claude 4.0 Opus (Search)
|
68.00%
|