Use case

Multimodal Reasoning

About this leaderboard

Tests visual+text reasoning on spatial layouts, charts, documents, OCR, temporal sequences, and grounded captions to reveal models that truly fuse modalities.

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

Spatial

rank model elo_rating win_rate
1 Claude 3.5 Sonnet 1220.00 75.56%
2 Gemini 2.0 Flash 1119.62 61.98%
3 O1 1105.56 58.89%
4 Pixtral Large 1074.21 60.99%
5 Gemini 1.5 Pro 1007.70 50.88%
6 GPT-4o 990.77 52.97%
7 AWS Nova Pro 747.61 15.94%
8 Llama 3.2 90B 734.52 25.00%

Captioning

rank model elo_rating win_rate
1 Pixtral Large 1144.53 68.05%
2 Claude 3.5 Sonnet 1114.53 68.40%
3 Gemini 1.5 Pro 1111.40 62.78%
4 AWS Nova Pro 1028.58 67.39%
5 Llama 3.2 90B 986.60 35.97%
6 Gemini 2.0 Flash 950.50 41.13%
7 GPT-4o 928.24 37.45%
8 O1 735.62 15.45%

Differences

rank model elo_rating win_rate
1 Pixtral Large 1172.83 73.08%
2 O1 1172.80 76.49%
3 Gemini 2.0 Flash 1132.23 60.48%
4 Claude 3.5 Sonnet 1070.33 49.78%
5 GPT-4o 1056.78 44.00%
6 AWS Nova Pro 880.61 42.26%
7 Gemini 1.5 Pro 870.03 37.93%
8 Llama 3.2 90B 644.39 11.26%

Storytelling

rank model elo_rating win_rate
1 O1 1303.60 89.74%
2 Gemini 2.0 Flash 1188.20 70.61%
3 Pixtral Large 1006.68 52.24%
4 Gemini 1.5 Pro 995.75 45.49%
5 Claude 3.5 Sonnet 989.71 43.15%
6 AWS Nova Pro 903.02 42.13%
7 GPT-4o 895.53 42.86%
8 Llama 3.2 90B 717.51 12.86%