Use case

Image Generation

About this leaderboard

Ranks still-image models on fidelity, composition, text rendering, fine-grain details, and safety/NSFW defenses using aesthetic raters and pairwise preference/Elo comparisons.

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

Overall

Rank Model Elo Rating Trust Skill Rating Average Rank Rank 1 Percentage
1 GPT Image 1 1069.17 982.86 1.41 59.31
2 GPT 4.1 1039.62 979.89 1.4 60.14
3 Recraft v3 1039.37 959.04 1.63 36.52
4 Imagen 3 1024.05 963.71 1.46 53.78
5 flux_image 1008.33 967.36 1.54 45.82
6 DALL·E 3 976.28 913.58 1.45 55.25
7 Ideogram 2.0 939.63 910.42 1.55 44.7
8 Stable Diffusion 3 903.55 895.02 1.55 45.21