BenchLM Benchmarks
165 benchmarks · 240 model scores · Data from Jun 2, 2026
Multimodal35 benchmarks
mmmu
9 models
1Qwen3.6 PlusAlibaba86.0%
2Qwen3.5-122B-A10BAlibaba83.9%
3Qwen3.6-27BAlibaba82.9%
4Qwen3.5-27BAlibaba82.3%
5Qwen3.6-35B-A3BAlibaba81.7%
+4 moremmmu Pro
28 models
1GPT-5.4 ProOpenAI94%
2Claude Mythos PreviewAnthropic92.7%
3Gemini 3.1 ProGoogle83.9%
4Gemini 3.5 FlashGoogle83.6%
5GPT-5.5OpenAI81.2%
+23 moreaa Mmmu Pro
68 models
1Gemini 3.5 FlashGoogle84.3%
2Gemini 3.1 ProGoogle82.4%
3Muse SparkMeta80.5%
4Gemini 3 ProGoogle80.2%
5GPT-5.5OpenAI79.9%
+63 moreoffice Qa Pro
5 models
1Claude Opus 4.8Anthropic66.2%
2GPT-5.5OpenAI54.1%
3GPT-5.4OpenAI53.2%
4MiniMax M3MiniMax45.1%
5Claude Opus 4.7 (Adaptive)Anthropic43.6%
mmmu Pro Python
5 models
1GPT-5.5OpenAI83.2%
2GPT-5.4OpenAI82.1%
3Kimi K2.6Moonshot AI80.1%
4GPT-5.4 miniOpenAI78%
5GPT-5.4 nanoOpenAI69.5%
real World Qa
3 models
1Qwen3.6-35B-A3BAlibaba85.3%
2Qwen3.6-27BAlibaba84.1%
3LFM2.5-VL-450MLiquidAI58.4%
video Mme With Sub
4 models
1Qwen3.6-27BAlibaba87.7%
2MiMo-V2.5Xiaomi87.7%
3Qwen3.6-35B-A3BAlibaba86.6%
4MiniMax M3MiniMax85.4%
math Vision
9 models
1Qwen3.5 397BAlibaba88.6%
2Qwen3.6 PlusAlibaba88.0%
3Kimi K2.6Moonshot AI87.4%
4Gemini 3 ProGoogle86.6%
5Qwen3.5-122B-A10BAlibaba86.2%
+4 morerefcoco Avg
4 models
1Qwen3.6-27BAlibaba92.5%
2Qwen3.6-35B-A3BAlibaba92.0%
3Nemotron 3 Nano Omni 30B A3BNVIDIA90.5%
4Interfaze BetaInterfaze82.1%
erqa
6 models
1Gemini 3.1 ProGoogle69.4%
2GPT-5.4OpenAI65.4%
3Muse SparkMeta64.7%
4Qwen3.6-27BAlibaba62.5%
5Grok 4.20xAI54.1%
+1 morevideo Mmmu
8 models
1Gemini 3 ProGoogle87.6%
2Kimi K2.5Moonshot AI86.6%
3Qwen3.5 397BAlibaba84.7%
4MiniMax M3MiniMax84.6%
5Claude Opus 4.5Anthropic84.4%
+3 moremmvu
4 models
1Kimi K2.5Moonshot AI80.4%
2Qwen3.5-122B-A10BAlibaba74.7%
3Qwen3.5-27BAlibaba73.3%
4Qwen3.5-35B-A3BAlibaba72.3%
screen Spot Pro
14 models
1Claude Opus 4.8Anthropic87.9%
2GPT-5.4OpenAI85.4%
3Gemini 3.1 ProGoogle84.4%
4Muse SparkMeta84.1%
5Claude Opus 4.6Anthropic83.1%
+9 moremed Xpert Qa Mm
5 models
1Gemini 3.1 ProGoogle81.3%
2Muse SparkMeta78.4%
3GPT-5.4OpenAI77.1%
4Grok 4.20xAI65.8%
5Claude Opus 4.6Anthropic64.8%
simple Vqa
7 models
1Step 3.7 FlashStepFun79.2%
2Gemini 3.1 ProGoogle72.4%
3Muse SparkMeta71.3%
4GPT-5.4OpenAI61.1%
5Qwen3.6-35B-A3BAlibaba58.9%
+2 morev Star
11 models
1Kimi K2.6Moonshot AI96.9%
2Qwen3.6 PlusAlibaba96.9%
3Qwen3.5 397BAlibaba95.8%
4Step 3.7 FlashStepFun95.3%
5Qwen3.6-27BAlibaba94.7%
+6 morecharxiv
22 models
1Claude Mythos PreviewAnthropic93.2%
2Claude Opus 4.7 (Adaptive)Anthropic91%
3Claude Opus 4.8Anthropic89.9%
4Muse SparkMeta86.4%
5Gemini 3.5 FlashGoogle84.2%
+17 morecharxiv No Tools
3 models
1Claude Mythos PreviewAnthropic86.1%
2Claude Opus 4.7 (Adaptive)Anthropic82.1%
3Claude Opus 4.8Anthropic80.5%