context.vn

BenchLM Benchmarks

165 benchmarks · 480 model scores · Data from Jun 2, 2026

Reasoning19 benchmarks

bbh

3 models

1DeepSeek V4 Pro BaseDeepSeek87.5%
2DeepSeek V4 Flash BaseDeepSeek86.9%
3MiniCPM5-1BOpenBMB71.9%
drop

2 models

1DeepSeek V4 Pro BaseDeepSeek88.7%
2DeepSeek V4 Flash BaseDeepSeek88.6%
hellaswag

2 models

1DeepSeek V4 Pro BaseDeepSeek88.0%
2DeepSeek V4 Flash BaseDeepSeek85.7%
winogrande

2 models

1DeepSeek V4 Pro BaseDeepSeek81.5%
2DeepSeek V4 Flash BaseDeepSeek79.5%
cluewsc

2 models

1DeepSeek V4 Pro BaseDeepSeek85.2%
2DeepSeek V4 Flash BaseDeepSeek82.2%
lisan Bench

62 models

4GPT 5.4 (medium)openai/gpt-5.4:thinking-mediumClosed
6Gemini 3.1 Pro Preview (high)google/gemini-3.1-pro-preview:thinking-highClosed
7Grok 4 (medium)x-ai/grok-4:thinking-mediumClosed
8Grok 4.20 Beta (thinking)x-ai/grok-4.20-beta:thinkingClosed
9GPT 5 (medium)openai/gpt-5Closed
+57 more
pp Bench

41 models

1GPT-5.5OpenAIgpt-5.5@xhigh
2GPT-5.4OpenAIgpt-5.4@xhigh
3GPT-5.2OpenAIgpt-5.2@xhigh
4Claude Opus 4.7claude-opus-4-7@thinkingClosed
5Gemini 3.5 Flashgemini-3.5-flash@highClosed
+36 more
long Bench V2

10 models

1Claude Opus 4.5Anthropic64.4%
2Qwen3.5 397BAlibaba63.2%
3Qwen3.6 PlusAlibaba62%
4Kimi K2.5Moonshot AI61%
5GLM-5Z.AI60.8%
+5 more
mrcrv2

2 models

1Qwen3.7 MaxAlibaba90.4%
2Gemini 3.5 FlashGoogle77.3%
mrcrv2 64 128

1 models

1GPT-5.5OpenAI83.1%
mrcrv2 128 256

2 models

1GPT-5.5OpenAI87.5%
2Claude Opus 4.7 (Adaptive)Anthropic59.2%
mrcr1m

7 models

1DeepSeek V4 Pro (Max)DeepSeek83.5%
2DeepSeek V4 Pro (High)DeepSeek83.3%
3DeepSeek V4 Flash (Max)DeepSeek78.7%
4DeepSeek V4 Flash (High)DeepSeek76.9%
5DeepSeek V4 ProDeepSeek44.7%
+2 more
corpus Qa1m

6 models

1DeepSeek V4 Pro (Max)DeepSeek62.0%
2DeepSeek V4 Flash (Max)DeepSeek60.5%
3DeepSeek V4 Flash (High)DeepSeek59.3%
4DeepSeek V4 Pro (High)DeepSeek56.5%
5DeepSeek V4 ProDeepSeek35.6%
+1 more
arc Agi2

11 models

1GPT-5.5OpenAI85%
2GPT-5.4 ProOpenAI83.3%
3Gemini 3.1 ProGoogle77.1%
4Claude Opus 4.7 (Adaptive)Anthropic75.8%
5Gemini 3.5 FlashGoogle72.1%
+6 more
ai Needle

4 models

1Claude Opus 4.5Anthropic74%
2Qwen3.5 397BAlibaba68.7%
3Qwen3.6 PlusAlibaba68.3%
4GLM-5Z.AI63.3%
gpqa Diamond

29 models

1Gemini 3.1 ProGoogle94.3%
2Claude Opus 4.7 (Adaptive)Anthropic94.2%
3Claude Opus 4.8Anthropic93.6%
4GPT-5.5OpenAI93.6%
5GPT-5.4OpenAI92.8%
+24 more
lcr

115 models

1GPT-5.2-CodexOpenAI75.7%
2GPT-5 (high)OpenAI75.6%
3GPT-5.1OpenAI75.0%
4GPT-5.5OpenAI74.3%
5GPT-5.4OpenAI74.0%
+110 more
critpt

116 models

1GPT-5.4 ProOpenAI30.0%
2GPT-5.5OpenAI27.1%
3Gemini 3 Pro Deep ThinkGoogle25.7%
4GPT-5.4OpenAI23.4%
5Gemini 3.1 ProGoogle17.7%
+111 more
bullshit Bench V2

63 models

4Claude Opus 4.6 (high)anthropic/claude-opus-4.6@reasoning=highClosed
6Claude Opus 4.7 (none)anthropic/claude-opus-4.7@reasoning=noneClosed
7Claude Sonnet 4.5 (high)anthropic/claude-sonnet-4.5@reasoning=highClosed
9Qwen3.5 397B (Reasoning) (high)qwen/qwen3.5-397b-a17b@reasoning=highOpen
10Claude Haiku 4.5 (high)anthropic/claude-haiku-4.5@reasoning=highClosed
+58 more