context.vn

BenchLM Benchmarks

165 benchmarks · 3361 model scores · Data from Jun 2, 2026

Coding23 benchmarks

humaneval

2 models

1DeepSeek V4 Pro BaseDeepSeek76.8%
2DeepSeek V4 Flash BaseDeepSeek69.5%
big Code Bench

2 models

1DeepSeek V4 Pro BaseDeepSeek59.2%
2DeepSeek V4 Flash BaseDeepSeek56.8%
codeforces

4 models

1DeepSeek V4 Pro (Max)DeepSeek3206.0
2DeepSeek V4 Flash (Max)DeepSeek3052.0
3DeepSeek V4 Pro (High)DeepSeek2919.0
4DeepSeek V4 Flash (High)DeepSeek2816.0
swe Verified

49 models

1Claude Mythos PreviewAnthropic93.9%
2Claude Opus 4.8Anthropic88.6%
3Claude Opus 4.7 (Adaptive)Anthropic87.6%
4GPT-5.3 CodexOpenAI85%
5Claude Opus 4.5Anthropic80.9%
+44 more
swe Rebench

13 models

1Claude Opus 4.6Anthropic65.3%
2GLM-5Z.AI62.8%
3GLM-5.1Z.AI62.7%
4DeepSeek V3.2DeepSeek60.9%
5Claude Sonnet 4.6Anthropic60.7%
+8 more
live Code Bench

14 models

1DeepSeek V4 Pro (Max)DeepSeek93.5%
2Qwen3.7 MaxAlibaba91.6%
3DeepSeek V4 Flash (Max)DeepSeek91.6%
4DeepSeek V4 Pro (High)DeepSeek89.8%
5Kimi K2.6Moonshot AI89.6%
+9 more
live Code Bench V6

8 models

1Kimi K2.6Moonshot AI89.6%
2Qwen3.6 PlusAlibaba87.1%
3Kimi K2.5Moonshot AI85.0%
4Claude Opus 4.5Anthropic84.8%
5Qwen3.5 397BAlibaba83.6%
+3 more
live Code Bench Pro

6 models

1GPT-5.4OpenAI87.5%
2Gemini 3.1 ProGoogle82.9%
3Muse SparkMeta80.0%
4Grok 4.20xAI74.2%
5Claude Opus 4.6Anthropic70.7%
+1 more
swe Pro

35 models

1Claude Mythos PreviewAnthropic77.8%
2Claude Opus 4.8Anthropic69.2%
3Claude Opus 4.7 (Adaptive)Anthropic64.3%
4Qwen3.7 MaxAlibaba60.6%
5MiniMax M3MiniMax59%
+30 more
swe Multilingual

20 models

1Claude Opus 4.8Anthropic84.4%
2Composer 2.5Cursor79.8%
3Qwen3.7 MaxAlibaba78.3%
4Claude Opus 4.5Anthropic77.5%
5Kimi K2.6Moonshot AI76.7%
+15 more
swe Multimodal

1 models

1Claude Opus 4.8Anthropic38.4%
cursor Bench31

7 models

1Claude Opus 4.7 (Adaptive)Anthropic64.8%
2Composer 2.5Cursor63.2%
3GPT-5.5OpenAI59.2%
4Composer 2Cursor52.2%
5Gemini 3.5 FlashGoogle49.8%
+2 more
multi Swe Bench

1 models

1MiniMax M2.7MiniMax52.7%
vibe Pro

1 models

1MiniMax M2.7MiniMax55.6%
program Bench

8 models

5Claude Opus 4.6claude-opus-4-6-programbenchClosed
7Claude Sonnet 4.6claude-sonnet-4-6-programbenchClosed
8GPT-5.4gpt-5-4-programbenchClosed
9Gemini 3.1 Progemini-3-1-pro-programbenchClosed
10Gemini 3 Flashgemini-3-flash-programbenchClosed
+3 more
nl2 Repo

8 models

1Qwen3.7 MaxAlibaba47.2%
2Claude Opus 4.5Anthropic43.2%
3Qwen 3.6 Max (preview)Alibaba42.9%
4GLM-5.1Z.AI42.7%
5MiniMax M3MiniMax42.1%
+3 more
react Native Evals

16 models

1Composer 2Cursor96.1%
2Composer 2 FastCursor94.9%
3GPT-5.4OpenAI85.3%
4GPT-5.5OpenAI84.7%
5Claude Opus 4.6Anthropic84.1%
+11 more
swe Verified Arcee

5 models

1Claude Opus 4.6Anthropic75.6%
2MiniMax M2.7MiniMax75.4%
3GLM-5Z.AI72.8%
4Kimi K2.5Moonshot AI70.8%
5Trinity-Large-ThinkingArcee AI63.2%
spider2 Lite

1 models

1Interfaze BetaInterfaze52.9%
sci Code

9 models

1Qwen3.7 MaxAlibaba53.5%
2Gemini 3.5 FlashGoogle53.1%
3Kimi K2.6Moonshot AI52.2%
4Kimi K2.5Moonshot AI48.7%
5Grok 4.3xAI47.3%
+4 more
aa Coding Index

119 models

1GPT-5.5OpenAI59.1%
2GPT-5.4OpenAI57.3%
3Gemini 3.1 ProGoogle55.5%
4GPT-5.3 CodexOpenAI53.1%
5Claude Opus 4.7Anthropic53.1%
+114 more
aa Sci Code

122 models

1Gemini 3.1 ProGoogle58.9%
2GPT-5.4OpenAI56.6%
3GPT-5.5OpenAI56.1%
4Gemini 3 ProGoogle56.1%
5GPT-5.2-CodexOpenAI54.6%
+117 more
terminal Bench Hard

115 models

1GPT-5.5OpenAI60.6%
2GPT-5.4OpenAI57.6%
3Claude Opus 4.7Anthropic54.5%
4Gemini 3.1 ProGoogle53.8%
5GPT-5.3 CodexOpenAI53.0%
+110 more

Agentic32 benchmarks

terminal Bench2

24 models

1GPT-5.5OpenAI82.0%
2Gemini 3.5 FlashGoogle76.2%
3Claude Opus 4.8Anthropic74.6%
4Qwen3.7 MaxAlibaba69.7%
5Claude Opus 4.7 (Adaptive)Anthropic69.4%
+19 more
browse Comp

24 models

1GPT-5.5 ProOpenAI90.1%
2GPT-5.4 ProOpenAI89.3%
3Claude Mythos PreviewAnthropic86.9%
4GPT-5.5OpenAI84.4%
5Claude Opus 4.8Anthropic84.3%
+19 more
hle With Tools

6 models

1Qwen3.7 MaxAlibaba53.5%
2DeepSeek V4 Pro (Max)DeepSeek48.2%
3Step 3.7 FlashStepFun47.2%
4DeepSeek V4 Flash (Max)DeepSeek45.1%
5DeepSeek V4 Pro (High)DeepSeek44.7%
+1 more
gdpval Aa

113 models

1Claude Opus 4.8Anthropic1890
2GPT-5.5OpenAI1769
3Claude Opus 4.7 (Adaptive)Anthropic1753
4Claude Opus 4.7Anthropic1680
5GPT-5.4OpenAI1674
+108 more
gdpval Aa Normalized

114 models

1GPT-5.5OpenAI63.5%
2Claude Opus 4.7 (Adaptive)Anthropic62.6%
3Claude Opus 4.7Anthropic59.0%
4GPT-5.4OpenAI58.7%
5Gemini 3.5 FlashGoogle57.8%
+109 more
aa Agentic Index

113 models

1GPT-5.5OpenAI74.1%
2Claude Opus 4.7 (Adaptive)Anthropic71.3%
3Gemini 3.5 FlashGoogle70.3%
4GPT-5.4OpenAI68.0%
5Claude Opus 4.6 (Adaptive)Anthropic67.6%
+108 more
apex Agents Aa

20 models

1Gemini 3.5 FlashGoogle47.1%
2GPT-5.5OpenAI37.7%
3GPT-5.4OpenAI33.3%
4Claude Opus 4.6 (Adaptive)Anthropic33.0%
5Gemini 3.1 ProGoogle32.0%
+15 more
gert Labs

54 models

1Claude Opus 4.8Anthropic72.97%
2GPT-5.5OpenAI72.93%
3Claude Opus 4.7Anthropic65.59%
4GPT-5.4OpenAI64.89%
5Qwen3.7 MaxAlibaba64.27%
+49 more
os World Verified

21 models

1Claude Opus 4.8Anthropic83.4%
2Holo3-35B-A3BH Company82.6%
3Claude Mythos PreviewAnthropic79.6%
4Holo3-122B-A10BH Company78.8%
5GPT-5.5OpenAI78.7%
+16 more
cyber Gym

10 models

1Claude Mythos PreviewAnthropic83.1%
2GPT-5.5OpenAI81.8%
3GPT-5.4OpenAI79.0%
4Claude Opus 4.7 (Adaptive)Anthropic73.1%
5GLM-5.1Z.AI68.7%
+5 more
os World

2 models

1Claude Opus 4.5Anthropic66.3%
2Nemotron 3 Nano Omni 30B A3BNVIDIA47.4%
android World

1 models

1Qwen3.6-27BAlibaba70.3%
mcp Atlas

23 models

1Gemini 3.5 FlashGoogle83.6%
2Claude Opus 4.8Anthropic82.2%
3Claude Opus 4.7 (Adaptive)Anthropic77.3%
4Qwen3.7 MaxAlibaba76.4%
5GPT-5.5OpenAI75.3%
+18 more
toolathlon

21 models

1Claude Opus 4.8Anthropic59.9%
2Gemini 3.5 FlashGoogle56.5%
3GPT-5.5OpenAI55.6%
4GPT-5.4OpenAI54.6%
5DeepSeek V4 Pro (Max)DeepSeek51.8%
+16 more
tau2 Bench

116 models

1GLM-5V-TurboZ.AI98.5%
2GLM-5-TurboZ.AI98.5%
3GLM-5Z.AI98.2%
4GLM-5.1Z.AI97.7%
5Qwen3.6 PlusAlibaba97.7%
+111 more
deep Search Qa

9 models

1Claude Opus 4.8Anthropic93.1%
2Step 3.7 FlashStepFun92.8%
3Kimi K2.6Moonshot AI92.5%
4Kimi K2.5Moonshot AI77.1%
5Muse SparkMeta74.8%
+4 more
tau2 Airline

1 models

1ZAYA1-74B-PreviewZyphra56.1%
pinch Bench

46 models

3MiniMax M2.7MiniMaxminimax/minimax-m2.7
4Claude Opus 4.6anthropic/claude-opus-4.6Closed
5MiMo-V2-Omnixiaomi/mimo-v2-omniClosed
6GLM-5.1z-ai/glm-5.1Open
7Qwen3.5-122B-A10Bqwen/qwen3.5-122b-a10bOpen
+41 more
open Hands Index

25 models

1Claude Opus 4.7 (Adaptive)Anthropicclaude-opus-4-7
2Claude Opus 4.6Anthropicclaude-opus-4-6
3GPT-5.5OpenAI65.9%
4GPT-5.4OpenAI64.3%
5Claude Opus 4.5claude-opus-4-5Closed
+20 more
swe Atlas Refactoring

10 models

2GPT-5.5OpenAIGpt-5.5 (Codex)
3GPT-5.4OpenAIGpt-5.4 (Codex)
4GPT-5.3 CodexGpt-5.3 (Codex)Closed
5Claude Opus 4.6Opus-4.6 (Claude Code)Closed
6Gemini 3.1 ProGemini-3.1-Pro (Gemini CLI)Closed
+5 more
inference Bench

14 models

1Claude Sonnet 4.6Anthropic8.08x
2GLM-5Z.AI6.20x
3Gemini 3.1 ProGoogle6.16x
4GPT-5.3 Codex (High)OpenAI5.48x
5GPT-5.4 (High)OpenAI5.08x
+9 more
bfcl V4

5 models

1Qwen3.7 MaxAlibaba75.0%
2LFM2.5-8B-A1BLiquidAI49.7%
3ZAYA1-8BZyphra39.2%
4MiniCPM5-1BOpenBMB25.1%
5LFM2.5-VL-450MLiquidAI21.1%
mle Bench Lite

1 models

1MiniMax M2.7MiniMax66.6%
mm Claw Bench

2 models

1MiniMax M2.7MiniMax62.7%
2MiMo-V2.5Xiaomi23.8%
claw Eval

22 models

1Claude Opus 4.6Anthropicopus46
2Claude Sonnet 4.6Anthropicsonnet46
3MiMo-V2.5-ProXiaomimimo_v25_pro
4Muse Sparkmuse_sparkClosed
5Kimi K2.6kimi_k26Open
+17 more
qwen Claw Bench

9 models

1Qwen3.7 MaxAlibaba64.3%
2Qwen 3.6 Max (preview)Alibaba59.0%
3Qwen3.6 PlusAlibaba57.2%
4Kimi K2.5Moonshot AI54.3%
5GLM-5Z.AI54.1%
+4 more
qwen Web Bench

4 models

1Qwen3.7 MaxAlibaba1568
2Qwen 3.6 Max (preview)Alibaba1532
3Qwen3.6-27BAlibaba1487
4Qwen3.6-35B-A3BAlibaba1397
tau3 Bench

9 models

1Mistral Medium 3.5 128BMistral91.4%
2MiMo-V2.5-ProXiaomi72.9%
3Qwen3.6 PlusAlibaba70.7%
4GLM-5.1Z.AI70.6%
5Claude Opus 4.5Anthropic70.2%
+4 more
vita Bench

8 models

1Qwen3.7 MaxAlibaba47.9%
2Qwen3.6 PlusAlibaba44.3%
3Qwen3.5 397BAlibaba43.7%
4Qwen3.6-35B-A3BAlibaba35.6%
5Claude Opus 4.5Anthropic23.3%
+3 more
deep Planning

6 models

1Qwen3.6 PlusAlibaba41.5%
2Qwen3.5 397BAlibaba37.6%
3Claude Opus 4.5Anthropic26.4%
4Qwen3.6-35B-A3BAlibaba25.9%
5GLM-5Z.AI14.6%
+1 more
mcp Tasks

5 models

1Qwen3.5 397BAlibaba74.2%
2Qwen3.6 PlusAlibaba74.1%
3Claude Opus 4.5Anthropic71.8%
4GLM-5Z.AI60.8%
5Kimi K2.5Moonshot AI59.1%
wide Research

7 models

1Kimi K2.6Moonshot AI80.8%
2Claude Opus 4.5Anthropic76.4%
3Qwen3.6 PlusAlibaba74.3%
4Qwen3.5 397BAlibaba74.0%
5Kimi K2.5Moonshot AI72.7%
+2 more

Reasoning19 benchmarks

bbh

3 models

1DeepSeek V4 Pro BaseDeepSeek87.5%
2DeepSeek V4 Flash BaseDeepSeek86.9%
3MiniCPM5-1BOpenBMB71.9%
drop

2 models

1DeepSeek V4 Pro BaseDeepSeek88.7%
2DeepSeek V4 Flash BaseDeepSeek88.6%
hellaswag

2 models

1DeepSeek V4 Pro BaseDeepSeek88.0%
2DeepSeek V4 Flash BaseDeepSeek85.7%
winogrande

2 models

1DeepSeek V4 Pro BaseDeepSeek81.5%
2DeepSeek V4 Flash BaseDeepSeek79.5%
cluewsc

2 models

1DeepSeek V4 Pro BaseDeepSeek85.2%
2DeepSeek V4 Flash BaseDeepSeek82.2%
lisan Bench

62 models

4GPT 5.4 (medium)openai/gpt-5.4:thinking-mediumClosed
6Gemini 3.1 Pro Preview (high)google/gemini-3.1-pro-preview:thinking-highClosed
7Grok 4 (medium)x-ai/grok-4:thinking-mediumClosed
8Grok 4.20 Beta (thinking)x-ai/grok-4.20-beta:thinkingClosed
9GPT 5 (medium)openai/gpt-5Closed
+57 more
pp Bench

41 models

1GPT-5.5OpenAIgpt-5.5@xhigh
2GPT-5.4OpenAIgpt-5.4@xhigh
3GPT-5.2OpenAIgpt-5.2@xhigh
4Claude Opus 4.7claude-opus-4-7@thinkingClosed
5Gemini 3.5 Flashgemini-3.5-flash@highClosed
+36 more
long Bench V2

10 models

1Claude Opus 4.5Anthropic64.4%
2Qwen3.5 397BAlibaba63.2%
3Qwen3.6 PlusAlibaba62%
4Kimi K2.5Moonshot AI61%
5GLM-5Z.AI60.8%
+5 more
mrcrv2

2 models

1Qwen3.7 MaxAlibaba90.4%
2Gemini 3.5 FlashGoogle77.3%
mrcrv2 64 128

1 models

1GPT-5.5OpenAI83.1%
mrcrv2 128 256

2 models

1GPT-5.5OpenAI87.5%
2Claude Opus 4.7 (Adaptive)Anthropic59.2%
mrcr1m

7 models

1DeepSeek V4 Pro (Max)DeepSeek83.5%
2DeepSeek V4 Pro (High)DeepSeek83.3%
3DeepSeek V4 Flash (Max)DeepSeek78.7%
4DeepSeek V4 Flash (High)DeepSeek76.9%
5DeepSeek V4 ProDeepSeek44.7%
+2 more
corpus Qa1m

6 models

1DeepSeek V4 Pro (Max)DeepSeek62.0%
2DeepSeek V4 Flash (Max)DeepSeek60.5%
3DeepSeek V4 Flash (High)DeepSeek59.3%
4DeepSeek V4 Pro (High)DeepSeek56.5%
5DeepSeek V4 ProDeepSeek35.6%
+1 more
arc Agi2

11 models

1GPT-5.5OpenAI85%
2GPT-5.4 ProOpenAI83.3%
3Gemini 3.1 ProGoogle77.1%
4Claude Opus 4.7 (Adaptive)Anthropic75.8%
5Gemini 3.5 FlashGoogle72.1%
+6 more
ai Needle

4 models

1Claude Opus 4.5Anthropic74%
2Qwen3.5 397BAlibaba68.7%
3Qwen3.6 PlusAlibaba68.3%
4GLM-5Z.AI63.3%
gpqa Diamond

29 models

1Gemini 3.1 ProGoogle94.3%
2Claude Opus 4.7 (Adaptive)Anthropic94.2%
3Claude Opus 4.8Anthropic93.6%
4GPT-5.5OpenAI93.6%
5GPT-5.4OpenAI92.8%
+24 more
lcr

115 models

1GPT-5.2-CodexOpenAI75.7%
2GPT-5 (high)OpenAI75.6%
3GPT-5.1OpenAI75.0%
4GPT-5.5OpenAI74.3%
5GPT-5.4OpenAI74.0%
+110 more
critpt

116 models

1GPT-5.4 ProOpenAI30.0%
2GPT-5.5OpenAI27.1%
3Gemini 3 Pro Deep ThinkGoogle25.7%
4GPT-5.4OpenAI23.4%
5Gemini 3.1 ProGoogle17.7%
+111 more
bullshit Bench V2

63 models

4Claude Opus 4.6 (high)anthropic/claude-opus-4.6@reasoning=highClosed
6Claude Opus 4.7 (none)anthropic/claude-opus-4.7@reasoning=noneClosed
7Claude Sonnet 4.5 (high)anthropic/claude-sonnet-4.5@reasoning=highClosed
9Qwen3.5 397B (Reasoning) (high)qwen/qwen3.5-397b-a17b@reasoning=highOpen
10Claude Haiku 4.5 (high)anthropic/claude-haiku-4.5@reasoning=highClosed
+58 more

Knowledge27 benchmarks

mmlu

8 models

1o1OpenAI91.8%
2GPT-4.1OpenAI90.2%
3DeepSeek V4 Pro BaseDeepSeek90.1%
4DeepSeek V4 Flash BaseDeepSeek88.7%
5GPT-4.1 miniOpenAI87.5%
+3 more
gpqa

54 models

1Claude Mythos PreviewAnthropic94.5%
2Claude Opus 4.7 (Adaptive)Anthropic94.2%
3Claude Opus 4.8Anthropic93.6%
4GPT-5.5OpenAI93.6%
5GPT-5.4OpenAI92.8%
+49 more
super Gpqa

18 models

1Claude Opus 4.6Anthropic95%
2Claude Sonnet 4.6Anthropic95%
3Qwen 3.6 Max (preview)Alibaba73.9%
4Qwen3.7 MaxAlibaba73.6%
5Qwen3.6 PlusAlibaba71.6%
+13 more
mmlu Pro

36 models

1Qwen3.7 MaxAlibaba89.6%
2Claude Opus 4.5Anthropic89.5%
3Qwen3.6 PlusAlibaba88.5%
4Qwen3.5 397BAlibaba87.8%
5DeepSeek V4 Pro (Max)DeepSeek87.5%
+31 more
agieval

2 models

1DeepSeek V4 Pro BaseDeepSeek83.1%
2DeepSeek V4 Flash BaseDeepSeek82.6%
hle

36 models

1Claude Mythos PreviewAnthropic64.7%
2GPT-5.4 ProOpenAI58.7%
3Claude Opus 4.8Anthropic57.9%
4GPT-5.5 ProOpenAI57.2%
5Claude Opus 4.7 (Adaptive)Anthropic54.7%
+31 more
frontier Science

1 models

1GPT-5.4 ProOpenAI36.7%
artificial Analysis

126 models

1GPT-5.5OpenAI60.2%
2Claude Opus 4.7 (Adaptive)Anthropic57.3%
3Gemini 3.1 ProGoogle57.2%
4GPT-5.4OpenAI56.8%
5Qwen3.7 MaxAlibaba56.6%
+121 more
aa Gpqa Diamond

122 models

1Gemini 3.1 ProGoogle94.1%
2GPT-5.5OpenAI93.5%
3Qwen3.7 MaxAlibaba92.3%
4Gemini 3.5 FlashGoogle92.2%
5GPT-5.4OpenAI92.0%
+117 more
aa Hle

122 models

1Gemini 3.1 ProGoogle44.7%
2GPT-5.5OpenAI44.3%
3GPT-5.4OpenAI41.6%
4Gemini 3.5 FlashGoogle41.0%
5GPT-5.3 CodexOpenAI39.9%
+117 more
aa Omniscience Index

114 models

1Gemini 3.1 ProGoogle32.9%
2Claude Opus 4.7 (Adaptive)Anthropic26.2%
3Gemini 3.5 FlashGoogle22.7%
4GPT-5.5OpenAI20.1%
5Grok 4.3xAI18.3%
+109 more
omniscience Accuracy

114 models

1GPT-5.5OpenAI56.9%
2Gemini 3 ProGoogle55.9%
3Gemini 3.1 ProGoogle55.3%
4Gemini 3.5 FlashGoogle51.9%
5GPT-5.3 CodexOpenAI51.8%
+109 more
omniscience Hallucination Rate

113 models

1Command A+Cohere14.1%
2Qwen3.7 MaxAlibaba22.9%
3MiMo-V2.5-ProXiaomi24.5%
4Grok 4.3xAI25.0%
5GLM-5.1Z.AI29.4%
+108 more
simple Qa

8 models

1DeepSeek V4 Pro (Max)DeepSeek57.9%
2DeepSeek V4 Pro BaseDeepSeek55.2%
3DeepSeek V4 Pro (High)DeepSeek46.2%
4DeepSeek V4 ProDeepSeek45%
5DeepSeek V4 Flash (Max)DeepSeek34.1%
+3 more
chinese Simple Qa

6 models

1DeepSeek V4 Pro (Max)DeepSeek84.4%
2DeepSeek V4 Flash (Max)DeepSeek78.9%
3DeepSeek V4 Pro (High)DeepSeek77.7%
4DeepSeek V4 ProDeepSeek75.8%
5DeepSeek V4 Flash (High)DeepSeek73.2%
+1 more
health Bench Hard

5 models

1Muse SparkMeta42.8%
2GPT-5.4OpenAI40.1%
3Gemini 3.1 ProGoogle20.6%
4Grok 4.20xAI20.3%
5Claude Opus 4.6Anthropic14.8%
med Xpert Qa Text

5 models

1Gemini 3.1 ProGoogle71.5%
2GPT-5.4OpenAI59.6%
3Muse SparkMeta52.6%
4Claude Opus 4.6Anthropic52.1%
5Grok 4.20xAI50.2%
frontier Science Research

1 models

1GPT-5.4 ProOpenAI36.7%
hle No Tools

16 models

1Claude Mythos PreviewAnthropic56.8%
2Claude Opus 4.8Anthropic49.8%
3Claude Opus 4.7 (Adaptive)Anthropic46.9%
4Gemini 3.1 ProGoogle45.4%
5GPT-5.5 ProOpenAI43.1%
+11 more
mmlu Pro Arcee

6 models

1Claude Opus 4.6Anthropic89.1%
2Kimi K2.5Moonshot AI87.1%
3GLM-5Z.AI85.8%
4Trinity-Large-ThinkingArcee AI83.4%
5MiniMax M2.7MiniMax80.8%
+1 more
mmlu Redux

8 models

1Claude Opus 4.5Anthropic96.6%
2Qwen3.7 MaxAlibaba95%
3Qwen3.5 397BAlibaba94.9%
4Qwen3.6 PlusAlibaba94.5%
5Qwen3.6-27BAlibaba93.5%
+3 more
mmmlu

4 models

1Interfaze BetaInterfaze90.9%
2Qwen3.7 MaxAlibaba90.3%
3DeepSeek V4 Pro BaseDeepSeek90.3%
4DeepSeek V4 Flash BaseDeepSeek88.8%
c Eval

7 models

1Qwen3.6 PlusAlibaba93.3%
2DeepSeek V4 Pro BaseDeepSeek93.1%
3Qwen3.5 397BAlibaba93%
4Claude Opus 4.5Anthropic92.2%
5DeepSeek V4 Flash BaseDeepSeek92.1%
+2 more
cmmlu

2 models

1DeepSeek V4 Pro BaseDeepSeek90.8%
2DeepSeek V4 Flash BaseDeepSeek90.4%
multi Lo Ko

2 models

1DeepSeek V4 Pro BaseDeepSeek51.1%
2DeepSeek V4 Flash BaseDeepSeek42.2%
facts Parametric

2 models

1DeepSeek V4 Pro BaseDeepSeek62.6%
2DeepSeek V4 Flash BaseDeepSeek33.9%
trivia Qa

2 models

1DeepSeek V4 Pro BaseDeepSeek85.6%
2DeepSeek V4 Flash BaseDeepSeek82.8%

Multimodal35 benchmarks

mmmu

9 models

1Qwen3.6 PlusAlibaba86.0%
2Qwen3.5-122B-A10BAlibaba83.9%
3Qwen3.6-27BAlibaba82.9%
4Qwen3.5-27BAlibaba82.3%
5Qwen3.6-35B-A3BAlibaba81.7%
+4 more
mmmu Pro

28 models

1GPT-5.4 ProOpenAI94%
2Claude Mythos PreviewAnthropic92.7%
3Gemini 3.1 ProGoogle83.9%
4Gemini 3.5 FlashGoogle83.6%
5GPT-5.5OpenAI81.2%
+23 more
aa Mmmu Pro

68 models

1Gemini 3.5 FlashGoogle84.3%
2Gemini 3.1 ProGoogle82.4%
3Muse SparkMeta80.5%
4Gemini 3 ProGoogle80.2%
5GPT-5.5OpenAI79.9%
+63 more
ocr Bench V2

1 models

1Interfaze BetaInterfaze70.7%
olm Ocr

1 models

1Interfaze BetaInterfaze85.7%
vox Populi Wer

1 models

1Interfaze BetaInterfaze2.4%
design Arena Website

1 models

1Grok 4.3xAI1294
office Qa Pro

5 models

1Claude Opus 4.8Anthropic66.2%
2GPT-5.5OpenAI54.1%
3GPT-5.4OpenAI53.2%
4MiniMax M3MiniMax45.1%
5Claude Opus 4.7 (Adaptive)Anthropic43.6%
mmmu Pro Python

5 models

1GPT-5.5OpenAI83.2%
2GPT-5.4OpenAI82.1%
3Kimi K2.6Moonshot AI80.1%
4GPT-5.4 miniOpenAI78%
5GPT-5.4 nanoOpenAI69.5%
omni Doc Bench15

2 models

1MiniMax M3MiniMax91.6%
2Qwen3.6-35B-A3BAlibaba89.9%
real World Qa

3 models

1Qwen3.6-35B-A3BAlibaba85.3%
2Qwen3.6-27BAlibaba84.1%
3LFM2.5-VL-450MLiquidAI58.4%
video Mme With Sub

4 models

1Qwen3.6-27BAlibaba87.7%
2MiMo-V2.5Xiaomi87.7%
3Qwen3.6-35B-A3BAlibaba86.6%
4MiniMax M3MiniMax85.4%
video Mme No Sub

2 models

1Qwen3.6-35B-A3BAlibaba82.5%
2Nemotron 3 Nano Omni 30B A3BNVIDIA72.2%
video Mme

1 models

1Kimi K2.5Moonshot AI87.4%
math Vision

9 models

1Qwen3.5 397BAlibaba88.6%
2Qwen3.6 PlusAlibaba88.0%
3Kimi K2.6Moonshot AI87.4%
4Gemini 3 ProGoogle86.6%
5Qwen3.5-122B-A10BAlibaba86.2%
+4 more
dyna Math

1 models

1Qwen3.6-27BAlibaba85.6%
m Star

1 models

1Qwen3.6-27BAlibaba81.4%
mm Long Bench Doc

1 models

1Nemotron 3 Nano Omni 30B A3BNVIDIA57.5%
cc Ocr

2 models

1Qwen3.6-35B-A3BAlibaba81.9%
2Qwen3.6-27BAlibaba81.2%
ai2d Test

2 models

1Qwen3.6-35B-A3BAlibaba92.7%
2Nemotron 3 Nano Omni 30B A3BNVIDIA88.5%
count Bench

2 models

1Qwen3.6-27BAlibaba97.8%
2LFM2.5-VL-450MLiquidAI73.3%
refcoco Avg

4 models

1Qwen3.6-27BAlibaba92.5%
2Qwen3.6-35B-A3BAlibaba92.0%
3Nemotron 3 Nano Omni 30B A3BNVIDIA90.5%
4Interfaze BetaInterfaze82.1%
odinw13

1 models

1Qwen3.6-35B-A3BAlibaba50.8%
erqa

6 models

1Gemini 3.1 ProGoogle69.4%
2GPT-5.4OpenAI65.4%
3Muse SparkMeta64.7%
4Qwen3.6-27BAlibaba62.5%
5Grok 4.20xAI54.1%
+1 more
video Mmmu

8 models

1Gemini 3 ProGoogle87.6%
2Kimi K2.5Moonshot AI86.6%
3Qwen3.5 397BAlibaba84.7%
4MiniMax M3MiniMax84.6%
5Claude Opus 4.5Anthropic84.4%
+3 more
mlvu Avg

2 models

1Qwen3.6-27BAlibaba86.6%
2Qwen3.6-35B-A3BAlibaba86.2%
mmvu

4 models

1Kimi K2.5Moonshot AI80.4%
2Qwen3.5-122B-A10BAlibaba74.7%
3Qwen3.5-27BAlibaba73.3%
4Qwen3.5-35B-A3BAlibaba72.3%
screen Spot Pro

14 models

1Claude Opus 4.8Anthropic87.9%
2GPT-5.4OpenAI85.4%
3Gemini 3.1 ProGoogle84.4%
4Muse SparkMeta84.1%
5Claude Opus 4.6Anthropic83.1%
+9 more
med Xpert Qa Mm

5 models

1Gemini 3.1 ProGoogle81.3%
2Muse SparkMeta78.4%
3GPT-5.4OpenAI77.1%
4Grok 4.20xAI65.8%
5Claude Opus 4.6Anthropic64.8%
zero Bench

3 models

1GPT-5.4OpenAI41.0%
2Muse SparkMeta33.0%
3Gemini 3.1 ProGoogle29.0%
simple Vqa

7 models

1Step 3.7 FlashStepFun79.2%
2Gemini 3.1 ProGoogle72.4%
3Muse SparkMeta71.3%
4GPT-5.4OpenAI61.1%
5Qwen3.6-35B-A3BAlibaba58.9%
+2 more
v Star

11 models

1Kimi K2.6Moonshot AI96.9%
2Qwen3.6 PlusAlibaba96.9%
3Qwen3.5 397BAlibaba95.8%
4Step 3.7 FlashStepFun95.3%
5Qwen3.6-27BAlibaba94.7%
+6 more
charxiv

22 models

1Claude Mythos PreviewAnthropic93.2%
2Claude Opus 4.7 (Adaptive)Anthropic91%
3Claude Opus 4.8Anthropic89.9%
4Muse SparkMeta86.4%
5Gemini 3.5 FlashGoogle84.2%
+17 more
charxiv No Tools

3 models

1Claude Mythos PreviewAnthropic86.1%
2Claude Opus 4.7 (Adaptive)Anthropic82.1%
3Claude Opus 4.8Anthropic80.5%
blueprint Bench2

1 models

1Gemini 3.5 FlashGoogle33.6%

Math18 benchmarks

aime2024

1 models

1o3-miniOpenAI87.3%
aime2025

9 models

1Kimi K2.5 (Reasoning)Moonshot AI96.1%
2Kimi K2.5Moonshot AI96.1%
3GLM-4.7Z.AI95.7%
4MiMo-V2-FlashXiaomi94.1%
5Claude Sonnet 4.5Anthropic87%
+4 more
gsm8k

2 models

1DeepSeek V4 Pro BaseDeepSeek92.6%
2DeepSeek V4 Flash BaseDeepSeek90.8%
math Benchmark

2 models

1DeepSeek V4 Pro BaseDeepSeek64.5%
2DeepSeek V4 Flash BaseDeepSeek57.4%
cmath

2 models

1DeepSeek V4 Flash BaseDeepSeek93.6%
2DeepSeek V4 Pro BaseDeepSeek90.9%
aime2025 Arcee

6 models

1Claude Opus 4.6Anthropic99.8%
2Kimi K2.5Moonshot AI96.3%
3Trinity-Large-ThinkingArcee AI96.3%
4GLM-5Z.AI93.3%
5MiniMax M2.7MiniMax80.0%
+1 more
math500

2 models

1MiniCPM5-1BOpenBMB91.6%
2LFM2.5-8B-A1BLiquidAI88.8%
aime2026

13 models

1Kimi K2.6Moonshot AI96.4%
2GLM-5Z.AI95.8%
3Kimi K2.5Moonshot AI95.8%
4GLM-5.1Z.AI95.3%
5Qwen3.6 PlusAlibaba95.3%
+8 more
ipho2025 Theory

1 models

1GPT-5.4 ProOpenAI93.5%
hmmt Feb2025

7 models

1GLM-5Z.AI97.5%
2Qwen3.6 PlusAlibaba96.7%
3Kimi K2.5Moonshot AI95.4%
4Qwen3.5 397BAlibaba94.8%
5Qwen3.6-27BAlibaba93.8%
+2 more
hmmt Nov2025

8 models

1GLM-5Z.AI96.9%
2Qwen3.6 PlusAlibaba94.6%
3GLM-5.1Z.AI94.0%
4Claude Opus 4.5Anthropic93.3%
5Qwen3.5 397BAlibaba92.7%
+3 more
hmmt Feb2026

18 models

1Qwen3.7 MaxAlibaba97.1%
2DeepSeek V4 Pro (Max)DeepSeek95.2%
3DeepSeek V4 Flash (Max)DeepSeek94.8%
4DeepSeek V4 Pro (High)DeepSeek94.0%
5Kimi K2.6Moonshot AI92.7%
+13 more
imo Answer Bench

8 models

1Qwen3.7 MaxAlibaba90.0%
2DeepSeek V4 Pro (Max)DeepSeek89.8%
3DeepSeek V4 Flash (Max)DeepSeek88.4%
4DeepSeek V4 Pro (High)DeepSeek88.0%
5DeepSeek V4 Flash (High)DeepSeek85.1%
+3 more
apex

8 models

1Qwen3.7 MaxAlibaba44.5%
2DeepSeek V4 Pro (Max)DeepSeek38.3%
3DeepSeek V4 Flash (Max)DeepSeek33.0%
4ZAYA1-8BZyphra32.2%
5DeepSeek V4 Pro (High)DeepSeek27.4%
+3 more
apex Shortlist

6 models

1DeepSeek V4 Pro (Max)DeepSeek90.2%
2DeepSeek V4 Flash (Max)DeepSeek85.7%
3DeepSeek V4 Pro (High)DeepSeek85.5%
4DeepSeek V4 Flash (High)DeepSeek72.1%
5DeepSeek V4 FlashDeepSeek9.3%
+1 more
mm Answer Bench

9 models

1Kimi K2.6Moonshot AI86.0%
2Claude Opus 4.5Anthropic84.0%
3GLM-5.1Z.AI83.8%
4Qwen3.6 PlusAlibaba83.8%
5GLM-5Z.AI82.5%
+4 more
frontier Math

4 models

1GPT-5.5 ProOpenAI52.4%
2GPT-5.5OpenAI51.7%
3GPT-5.4 ProOpenAI50%
4Claude Opus 4.7 (Adaptive)Anthropic43.8%
usamo2026

3 models

1Claude Mythos PreviewAnthropic97.6%
2Claude Opus 4.8Anthropic96.7%
3MiniMax M3MiniMax85.7%

Multilingual6 benchmarks

mgsm

2 models

1DeepSeek V4 Flash BaseDeepSeek85.7%
2DeepSeek V4 Pro BaseDeepSeek84.4%
mmlu Pro X

10 models

1Qwen3.7 MaxAlibaba87%
2Claude Opus 4.5Anthropic85.7%
3Qwen3.6 PlusAlibaba84.7%
4Qwen3.5 397BAlibaba84.7%
5GLM-5Z.AI83.1%
+5 more
nova63

6 models

1Qwen3.5 397BAlibaba59.1%
2Qwen3.7 MaxAlibaba59.0%
3Qwen3.6 PlusAlibaba57.9%
4Claude Opus 4.5Anthropic56.7%
5Kimi K2.5Moonshot AI56.0%
+1 more
include

2 models

1Claude Opus 4.8Anthropic87.6%
2Qwen3.7 MaxAlibaba86.2%
poly Math

1 models

1Qwen3.7 MaxAlibaba86.5%
maxife

1 models

1Qwen3.7 MaxAlibaba89.2%

Instruction Following4 benchmarks

ifeval

19 models

1Qwen3.5-27BAlibaba95%
2Qwen3.7 MaxAlibaba94.3%
3Qwen3.6 PlusAlibaba94.3%
4Kimi K2.5Moonshot AI93.9%
5o3-miniOpenAI93.9%
+14 more
if Bench

11 models

1Grok 4.3xAI81.3%
2Qwen3.7 MaxAlibaba79.1%
3Gemini 3.5 FlashGoogle76.3%
4Qwen3.6 PlusAlibaba75.8%
5Nemotron 3 Nano Omni 30B A3BNVIDIA74.2%
+6 more
aa If Bench

116 models

1Grok 4.3xAI81.3%
2Qwen3.7 MaxAlibaba80.5%
3MiMo-V2.5-ProXiaomi79.9%
4DeepSeek V4 Flash (Max)DeepSeek79.2%
5Qwen3.5 397B (Reasoning)Alibaba78.8%
+111 more
sob Value Acc

1 models

1Interfaze BetaInterfaze79.5%

External1 benchmarks

deep Swe

12 models

1gpt-5.5[xhigh]OpenAI70%
2gpt-5.4[xhigh]OpenAI56%
3claude-opus-4.7[max]Anthropic54%
4claude-sonnet-4.6[high]Anthropic32%
5gemini-3.5-flash[medium]Google28%
+7 more