Use case

Constraint Bench

← Back to leaderboard

About this leaderboard

Evaluates safety-first reasoning under strict constraints and policy guardrails. Models are challenged with red-team style prompts, instruction adherence, refusal accuracy, and boundary conditions to expose brittleness.

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

Overall

Rank	Model	Overall Score
1	OpenAI GPT-5	79.20%
2	OpenAI o3	57.20%
3	Grok 4	33.30%
4	Google Gemini 2.5 Pro	28.60%
5	OpenAI o4-mini	14.40%
6	Deepseek R1	2.00%
7	Anthropic Claude 4 Sonnet	1.90%
8	Qwen 3	0.40%
9	OpenAI GPT-4o	0.30%