Use
case
Multimodal Reasoning
About this
leaderboard
Tests visual+text reasoning on spatial layouts, charts, documents, OCR, temporal sequences, and grounded captions to reveal models that truly fuse modalities.
We stress-test models with curated simulations that blend structured benchmarks and
open-ended prompts. Each table captures a distinct slice of the use case, and models are
compared consistently across shared metrics to surface leaders, trade-offs, and surprising
strengths.
|
rank
|
model
|
elo_rating
|
win_rate
|
|
1
|
Claude 3.5 Sonnet
|
1220.00
|
75.56%
|
|
2
|
Gemini 2.0 Flash
|
1119.62
|
61.98%
|
|
3
|
O1
|
1105.56
|
58.89%
|
|
4
|
Pixtral Large
|
1074.21
|
60.99%
|
|
5
|
Gemini 1.5 Pro
|
1007.70
|
50.88%
|
|
6
|
GPT-4o
|
990.77
|
52.97%
|
|
7
|
AWS Nova Pro
|
747.61
|
15.94%
|
|
8
|
Llama 3.2 90B
|
734.52
|
25.00%
|
|
rank
|
model
|
elo_rating
|
win_rate
|
|
1
|
Pixtral Large
|
1144.53
|
68.05%
|
|
2
|
Claude 3.5 Sonnet
|
1114.53
|
68.40%
|
|
3
|
Gemini 1.5 Pro
|
1111.40
|
62.78%
|
|
4
|
AWS Nova Pro
|
1028.58
|
67.39%
|
|
5
|
Llama 3.2 90B
|
986.60
|
35.97%
|
|
6
|
Gemini 2.0 Flash
|
950.50
|
41.13%
|
|
7
|
GPT-4o
|
928.24
|
37.45%
|
|
8
|
O1
|
735.62
|
15.45%
|
|
rank
|
model
|
elo_rating
|
win_rate
|
|
1
|
Pixtral Large
|
1172.83
|
73.08%
|
|
2
|
O1
|
1172.80
|
76.49%
|
|
3
|
Gemini 2.0 Flash
|
1132.23
|
60.48%
|
|
4
|
Claude 3.5 Sonnet
|
1070.33
|
49.78%
|
|
5
|
GPT-4o
|
1056.78
|
44.00%
|
|
6
|
AWS Nova Pro
|
880.61
|
42.26%
|
|
7
|
Gemini 1.5 Pro
|
870.03
|
37.93%
|
|
8
|
Llama 3.2 90B
|
644.39
|
11.26%
|
|
rank
|
model
|
elo_rating
|
win_rate
|
|
1
|
O1
|
1303.60
|
89.74%
|
|
2
|
Gemini 2.0 Flash
|
1188.20
|
70.61%
|
|
3
|
Pixtral Large
|
1006.68
|
52.24%
|
|
4
|
Gemini 1.5 Pro
|
995.75
|
45.49%
|
|
5
|
Claude 3.5 Sonnet
|
989.71
|
43.15%
|
|
6
|
AWS Nova Pro
|
903.02
|
42.13%
|
|
7
|
GPT-4o
|
895.53
|
42.86%
|
|
8
|
Llama 3.2 90B
|
717.51
|
12.86%
|