Use case

Constraint Bench

About this leaderboard

Evaluates safety-first reasoning under strict constraints and policy guardrails. Models are challenged with red-team style prompts, instruction adherence, refusal accuracy, and boundary conditions to expose brittleness.

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

Overall

Rank Model Overall Score
1 OpenAI GPT-5 79.20%
2 OpenAI o3 57.20%
3 Grok 4 33.30%
4 Google Gemini 2.5 Pro 28.60%
5 OpenAI o4-mini 14.40%
6 Deepseek R1 2.00%
7 Anthropic Claude 4 Sonnet 1.90%
8 Qwen 3 0.40%
9 OpenAI GPT-4o 0.30%