Use
case
Constraint Bench
About this
leaderboard
Evaluates safety-first reasoning under strict constraints and policy guardrails. Models are challenged with red-team style prompts, instruction adherence, refusal accuracy, and boundary conditions to expose brittleness.
We stress-test models with curated simulations that blend structured benchmarks and
open-ended prompts. Each table captures a distinct slice of the use case, and models are
compared consistently across shared metrics to surface leaders, trade-offs, and surprising
strengths.
|
Rank
|
Model
|
Overall Score
|
|
1
|
OpenAI GPT-5
|
79.20%
|
|
2
|
OpenAI o3
|
57.20%
|
|
3
|
Grok 4
|
33.30%
|
|
4
|
Google Gemini 2.5 Pro
|
28.60%
|
|
5
|
OpenAI o4-mini
|
14.40%
|
|
6
|
Deepseek R1
|
2.00%
|
|
7
|
Anthropic Claude 4 Sonnet
|
1.90%
|
|
8
|
Qwen 3
|
0.40%
|
|
9
|
OpenAI GPT-4o
|
0.30%
|