context.vn

BenchLM Benchmarks

165 benchmarks · 566 model scores · Data from Jun 2, 2026

Coding23 benchmarks

humaneval

2 models

1DeepSeek V4 Pro BaseDeepSeek76.8%
2DeepSeek V4 Flash BaseDeepSeek69.5%
big Code Bench

2 models

1DeepSeek V4 Pro BaseDeepSeek59.2%
2DeepSeek V4 Flash BaseDeepSeek56.8%
codeforces

4 models

1DeepSeek V4 Pro (Max)DeepSeek3206.0
2DeepSeek V4 Flash (Max)DeepSeek3052.0
3DeepSeek V4 Pro (High)DeepSeek2919.0
4DeepSeek V4 Flash (High)DeepSeek2816.0
swe Verified

49 models

1Claude Mythos PreviewAnthropic93.9%
2Claude Opus 4.8Anthropic88.6%
3Claude Opus 4.7 (Adaptive)Anthropic87.6%
4GPT-5.3 CodexOpenAI85%
5Claude Opus 4.5Anthropic80.9%
+44 more
swe Rebench

13 models

1Claude Opus 4.6Anthropic65.3%
2GLM-5Z.AI62.8%
3GLM-5.1Z.AI62.7%
4DeepSeek V3.2DeepSeek60.9%
5Claude Sonnet 4.6Anthropic60.7%
+8 more
live Code Bench

14 models

1DeepSeek V4 Pro (Max)DeepSeek93.5%
2Qwen3.7 MaxAlibaba91.6%
3DeepSeek V4 Flash (Max)DeepSeek91.6%
4DeepSeek V4 Pro (High)DeepSeek89.8%
5Kimi K2.6Moonshot AI89.6%
+9 more
live Code Bench V6

8 models

1Kimi K2.6Moonshot AI89.6%
2Qwen3.6 PlusAlibaba87.1%
3Kimi K2.5Moonshot AI85.0%
4Claude Opus 4.5Anthropic84.8%
5Qwen3.5 397BAlibaba83.6%
+3 more
live Code Bench Pro

6 models

1GPT-5.4OpenAI87.5%
2Gemini 3.1 ProGoogle82.9%
3Muse SparkMeta80.0%
4Grok 4.20xAI74.2%
5Claude Opus 4.6Anthropic70.7%
+1 more
swe Pro

35 models

1Claude Mythos PreviewAnthropic77.8%
2Claude Opus 4.8Anthropic69.2%
3Claude Opus 4.7 (Adaptive)Anthropic64.3%
4Qwen3.7 MaxAlibaba60.6%
5MiniMax M3MiniMax59%
+30 more
swe Multilingual

20 models

1Claude Opus 4.8Anthropic84.4%
2Composer 2.5Cursor79.8%
3Qwen3.7 MaxAlibaba78.3%
4Claude Opus 4.5Anthropic77.5%
5Kimi K2.6Moonshot AI76.7%
+15 more
swe Multimodal

1 models

1Claude Opus 4.8Anthropic38.4%
cursor Bench31

7 models

1Claude Opus 4.7 (Adaptive)Anthropic64.8%
2Composer 2.5Cursor63.2%
3GPT-5.5OpenAI59.2%
4Composer 2Cursor52.2%
5Gemini 3.5 FlashGoogle49.8%
+2 more
multi Swe Bench

1 models

1MiniMax M2.7MiniMax52.7%
vibe Pro

1 models

1MiniMax M2.7MiniMax55.6%
program Bench

8 models

5Claude Opus 4.6claude-opus-4-6-programbenchClosed
7Claude Sonnet 4.6claude-sonnet-4-6-programbenchClosed
8GPT-5.4gpt-5-4-programbenchClosed
9Gemini 3.1 Progemini-3-1-pro-programbenchClosed
10Gemini 3 Flashgemini-3-flash-programbenchClosed
+3 more
nl2 Repo

8 models

1Qwen3.7 MaxAlibaba47.2%
2Claude Opus 4.5Anthropic43.2%
3Qwen 3.6 Max (preview)Alibaba42.9%
4GLM-5.1Z.AI42.7%
5MiniMax M3MiniMax42.1%
+3 more
react Native Evals

16 models

1Composer 2Cursor96.1%
2Composer 2 FastCursor94.9%
3GPT-5.4OpenAI85.3%
4GPT-5.5OpenAI84.7%
5Claude Opus 4.6Anthropic84.1%
+11 more
swe Verified Arcee

5 models

1Claude Opus 4.6Anthropic75.6%
2MiniMax M2.7MiniMax75.4%
3GLM-5Z.AI72.8%
4Kimi K2.5Moonshot AI70.8%
5Trinity-Large-ThinkingArcee AI63.2%
spider2 Lite

1 models

1Interfaze BetaInterfaze52.9%
sci Code

9 models

1Qwen3.7 MaxAlibaba53.5%
2Gemini 3.5 FlashGoogle53.1%
3Kimi K2.6Moonshot AI52.2%
4Kimi K2.5Moonshot AI48.7%
5Grok 4.3xAI47.3%
+4 more
aa Coding Index

119 models

1GPT-5.5OpenAI59.1%
2GPT-5.4OpenAI57.3%
3Gemini 3.1 ProGoogle55.5%
4GPT-5.3 CodexOpenAI53.1%
5Claude Opus 4.7Anthropic53.1%
+114 more
aa Sci Code

122 models

1Gemini 3.1 ProGoogle58.9%
2GPT-5.4OpenAI56.6%
3GPT-5.5OpenAI56.1%
4Gemini 3 ProGoogle56.1%
5GPT-5.2-CodexOpenAI54.6%
+117 more
terminal Bench Hard

115 models

1GPT-5.5OpenAI60.6%
2GPT-5.4OpenAI57.6%
3Claude Opus 4.7Anthropic54.5%
4Gemini 3.1 ProGoogle53.8%
5GPT-5.3 CodexOpenAI53.0%
+110 more