Use case

Agentic Search

About this leaderboard

Measures retrieval-augmented agents on query planning, evidence recall, grounded answers, citation quality, latency, and recency across web/news domains.

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

STEM Questions

rank model overall_score
1 OpenAI GPT-4.1 (Search) 82.00%
2 Google Gemini 2.5 Pro (Search) 78.00%
3 Anthropic Claude 4.0 Opus (Search) 72.00%

Historical archival

rank model overall_score
1 Google Gemini 2.5 Pro (Search) 77.00%
2 OpenAI GPT-4.1 (Search) 74.00%
3 Anthropic Claude 4.0 Opus (Search) 71.00%

Aggregate performance

rank model overall_score
1 Google Gemini 2.5 Pro (Search) 80.00%
2 OpenAI GPT-4.1 (Search) 72.50%
3 Anthropic Claude 4.0 Opus (Search) 70.00%

Multi language context

rank model overall_score
1 Google Gemini 2.5 Pro (Search) 82.00%
2 Anthropic Claude 4.0 Opus (Search) 69.00%
3 OpenAI GPT-4.1 (Search) 68.00%

Faulty adversarial prompts

rank model overall_score
1 Google Gemini 2.5 Pro (Search) 83.00%
2 OpenAI GPT-4.1 (Search) 62.00%
3 Anthropic Claude 4.0 Opus (Search) 58.00%

Recent news current events

rank model overall_score
1 Google Gemini 2.5 Pro (Search) 89.00%
2 OpenAI GPT-4.1 (Search) 68.00%
3 Anthropic Claude 4.0 Opus (Search) 64.00%

Specialized domain knowledge

rank model overall_score
1 Google Gemini 2.5 Pro (Search) 74.00%
2 OpenAI GPT-4.1 (Search) 71.00%
3 Anthropic Claude 4.0 Opus (Search) 68.00%