Use case

Multimodal Reasoning

← Back to leaderboard

About this leaderboard

Tests visual+text reasoning on spatial layouts, charts, documents, OCR, temporal sequences, and grounded captions to reveal models that truly fuse modalities.

We stress-test models with curated simulations that blend structured benchmarks and open-ended prompts. Each table captures a distinct slice of the use case, and models are compared consistently across shared metrics to surface leaders, trade-offs, and surprising strengths.

Spatial

rank	model	elo_rating	win_rate
1	Claude 3.5 Sonnet	1220.00	75.56%
2	Gemini 2.0 Flash	1119.62	61.98%
3	O1	1105.56	58.89%
4	Pixtral Large	1074.21	60.99%
5	Gemini 1.5 Pro	1007.70	50.88%
6	GPT-4o	990.77	52.97%
7	AWS Nova Pro	747.61	15.94%
8	Llama 3.2 90B	734.52	25.00%

Captioning

rank	model	elo_rating	win_rate
1	Pixtral Large	1144.53	68.05%
2	Claude 3.5 Sonnet	1114.53	68.40%
3	Gemini 1.5 Pro	1111.40	62.78%
4	AWS Nova Pro	1028.58	67.39%
5	Llama 3.2 90B	986.60	35.97%
6	Gemini 2.0 Flash	950.50	41.13%
7	GPT-4o	928.24	37.45%
8	O1	735.62	15.45%

Differences

rank	model	elo_rating	win_rate
1	Pixtral Large	1172.83	73.08%
2	O1	1172.80	76.49%
3	Gemini 2.0 Flash	1132.23	60.48%
4	Claude 3.5 Sonnet	1070.33	49.78%
5	GPT-4o	1056.78	44.00%
6	AWS Nova Pro	880.61	42.26%
7	Gemini 1.5 Pro	870.03	37.93%
8	Llama 3.2 90B	644.39	11.26%

Storytelling

rank	model	elo_rating	win_rate
1	O1	1303.60	89.74%
2	Gemini 2.0 Flash	1188.20	70.61%
3	Pixtral Large	1006.68	52.24%
4	Gemini 1.5 Pro	995.75	45.49%
5	Claude 3.5 Sonnet	989.71	43.15%
6	AWS Nova Pro	903.02	42.13%
7	GPT-4o	895.53	42.86%
8	Llama 3.2 90B	717.51	12.86%