Use
case
Image Generation
About this
leaderboard
Ranks still-image models on fidelity, composition, text rendering, fine-grain details, and safety/NSFW defenses using aesthetic raters and pairwise preference/Elo comparisons.
We stress-test models with curated simulations that blend structured benchmarks and
open-ended prompts. Each table captures a distinct slice of the use case, and models are
compared consistently across shared metrics to surface leaders, trade-offs, and surprising
strengths.
|
Rank
|
Model
|
Elo Rating
|
Trust Skill Rating
|
Average Rank
|
Rank 1 Percentage
|
|
1
|
GPT Image 1
|
1069.17
|
982.86
|
1.41
|
59.31
|
|
2
|
GPT 4.1
|
1039.62
|
979.89
|
1.4
|
60.14
|
|
3
|
Recraft v3
|
1039.37
|
959.04
|
1.63
|
36.52
|
|
4
|
Imagen 3
|
1024.05
|
963.71
|
1.46
|
53.78
|
|
5
|
flux_image
|
1008.33
|
967.36
|
1.54
|
45.82
|
|
6
|
DALL·E 3
|
976.28
|
913.58
|
1.45
|
55.25
|
|
7
|
Ideogram 2.0
|
939.63
|
910.42
|
1.55
|
44.7
|
|
8
|
Stable Diffusion 3
|
903.55
|
895.02
|
1.55
|
45.21
|