FIND THE BEST MODEL FOR....
LEGAL

Comprehensive benchmarking results for the latest Large Language Models across multiple performance metrics

View Benchmarks Benchmark Your Model

0

Models Tested

0

Benchmarks Run

0

Performance Metrics

Key Performance Metrics

⚡

Speed

Response time and tokens per second measurement

🎯

Accuracy

Correctness and reliability across various tasks

🧠

Cost

Price per token and overall efficiency

⏱️

Thinking

Advanced reasoning and computation capabilities

💻

Coding

Software engineering and code generation capabilities

Latest Benchmark Results

Top 5 LLM models with comprehensive performance metrics

Gemini 3 Pro

Google

⚡ Speed ~110 TPS

🎯 Accuracy 91.9% 🏆

🧠 Reasoning 100/100

💰 Cost $2/$8

⏱️ Latency 0.40s

📊 Throughput High

💾 Context 2M

🧩 Thinking Native

💻 Coding 76.2%

Claude Sonnet 4.5

Anthropic

⚡ Speed ~85 TPS

🎯 Accuracy 88.5%

🧠 Reasoning 92/100

💰 Cost $3/$15

⏱️ Latency 0.45s

📊 Throughput Medium

💾 Context 200K

🧩 Thinking 85/100

💻 Coding 82.0% 💻

Llama 4 Scout

Leaderboard

Three featured boards with the leading two models and their standout metrics. Stay on top of the fastest movers.

Use Case

Complex Reasoning

The Benchmark Hub complex reasoning leaderboard rigorously evaluates frontier models across math proofs, coding puzzles, temporal and spatial reasoning, multi-step logic, and abstract problem-solving. Scores come from curated suites, ensemble judges, and consistency checks to highlight systems that stay reliable under varied reasoning workloads.

Details

#1

Model

OpenAI GPT-5

View board

overall_score

92.20%

overall_score

92.20%

#2

Model

xAI Grok 4

View board

overall_score

81.80%

overall_score

81.80%

Top 2 models • refreshed regularly Full boards

Use Case

Constraint Bench

Evaluates safety-first reasoning under strict constraints and policy guardrails. Models are challenged with red-team style prompts, instruction adherence, refusal accuracy, and boundary conditions to expose brittleness.

Details

#1

Model

OpenAI GPT-5

View board

Overall Score

79.20%

Overall Score

79.20%

#2

Model

OpenAI o3

View board

Overall Score

57.20%

Overall Score

57.20%

Top 2 models • refreshed regularly Full boards

Use Case

Video Generation

Scores story-driven video models on motion consistency, frame coherence, lip sync, text rendering, temporal alignment, and safety filters across varied shot types and prompts.

Details

#1

Model

Veo 3 (w/o audio)

View board

Elo Rating

1267.99

Trust Skill Rating

1178.14

#2

Model

Veo 2

View board

Elo Rating

1135.46

Trust Skill Rating

1065.5

Top 2 models • refreshed regularly Full boards

View full leaderboard

About The Benchmark Hub

Benchmark Hub provides comprehensive, objective performance analysis of Large Language Models. Our testing methodology ensures fair comparisons across multiple dimensions including speed, accuracy, reasoning capabilities, and cost-effectiveness.

We continuously update our benchmarks to reflect the latest model releases and improvements, helping developers, researchers, and organizations make informed decisions about which LLM best suits their needs.

FIND THE BEST MODEL FOR....
LEGAL

Key Performance Metrics

Speed

Accuracy

Reasoning

Cost

Latency

Throughput

Context Size

Thinking

Coding

Latest Benchmark Results

Latest News Highlights

Leaderboard

Complex Reasoning

Constraint Bench

Video Generation

Latest Reports

About The Benchmark Hub

FIND THE BEST MODEL FOR.... LEGAL

Key Performance Metrics

Speed

Accuracy

Reasoning

Cost

Latency

Throughput

Context Size

Thinking

Coding

Latest Benchmark Results

Latest News Highlights

Leaderboard

Complex Reasoning

Constraint Bench

Video Generation

Latest Reports

About The Benchmark Hub

FIND THE BEST MODEL FOR....
LEGAL