BenchLM Benchmarks
165 benchmarks · 566 model scores · Data from Jun 2, 2026
Coding23 benchmarks
codeforces
4 models
1DeepSeek V4 Pro (Max)DeepSeek3206.0
2DeepSeek V4 Flash (Max)DeepSeek3052.0
3DeepSeek V4 Pro (High)DeepSeek2919.0
4DeepSeek V4 Flash (High)DeepSeek2816.0
swe Verified
49 models
1Claude Mythos PreviewAnthropic93.9%
2Claude Opus 4.8Anthropic88.6%
3Claude Opus 4.7 (Adaptive)Anthropic87.6%
4GPT-5.3 CodexOpenAI85%
5Claude Opus 4.5Anthropic80.9%
+44 moreswe Rebench
13 models
1Claude Opus 4.6Anthropic65.3%
2GLM-5Z.AI62.8%
3GLM-5.1Z.AI62.7%
4DeepSeek V3.2DeepSeek60.9%
5Claude Sonnet 4.6Anthropic60.7%
+8 morelive Code Bench
14 models
1DeepSeek V4 Pro (Max)DeepSeek93.5%
2Qwen3.7 MaxAlibaba91.6%
3DeepSeek V4 Flash (Max)DeepSeek91.6%
4DeepSeek V4 Pro (High)DeepSeek89.8%
5Kimi K2.6Moonshot AI89.6%
+9 morelive Code Bench V6
8 models
1Kimi K2.6Moonshot AI89.6%
2Qwen3.6 PlusAlibaba87.1%
3Kimi K2.5Moonshot AI85.0%
4Claude Opus 4.5Anthropic84.8%
5Qwen3.5 397BAlibaba83.6%
+3 morelive Code Bench Pro
6 models
1GPT-5.4OpenAI87.5%
2Gemini 3.1 ProGoogle82.9%
3Muse SparkMeta80.0%
4Grok 4.20xAI74.2%
5Claude Opus 4.6Anthropic70.7%
+1 moreswe Pro
35 models
1Claude Mythos PreviewAnthropic77.8%
2Claude Opus 4.8Anthropic69.2%
3Claude Opus 4.7 (Adaptive)Anthropic64.3%
4Qwen3.7 MaxAlibaba60.6%
5MiniMax M3MiniMax59%
+30 moreswe Multilingual
20 models
1Claude Opus 4.8Anthropic84.4%
2Composer 2.5Cursor79.8%
3Qwen3.7 MaxAlibaba78.3%
4Claude Opus 4.5Anthropic77.5%
5Kimi K2.6Moonshot AI76.7%
+15 morecursor Bench31
7 models
1Claude Opus 4.7 (Adaptive)Anthropic64.8%
2Composer 2.5Cursor63.2%
3GPT-5.5OpenAI59.2%
4Composer 2Cursor52.2%
5Gemini 3.5 FlashGoogle49.8%
+2 moreprogram Bench
8 models
5Claude Opus 4.6claude-opus-4-6-programbenchClosed
7Claude Sonnet 4.6claude-sonnet-4-6-programbenchClosed
8GPT-5.4gpt-5-4-programbenchClosed
9Gemini 3.1 Progemini-3-1-pro-programbenchClosed
10Gemini 3 Flashgemini-3-flash-programbenchClosed
+3 morenl2 Repo
8 models
1Qwen3.7 MaxAlibaba47.2%
2Claude Opus 4.5Anthropic43.2%
3Qwen 3.6 Max (preview)Alibaba42.9%
4GLM-5.1Z.AI42.7%
5MiniMax M3MiniMax42.1%
+3 morereact Native Evals
16 models
1Composer 2Cursor96.1%
2Composer 2 FastCursor94.9%
3GPT-5.4OpenAI85.3%
4GPT-5.5OpenAI84.7%
5Claude Opus 4.6Anthropic84.1%
+11 moreswe Verified Arcee
5 models
1Claude Opus 4.6Anthropic75.6%
2MiniMax M2.7MiniMax75.4%
3GLM-5Z.AI72.8%
4Kimi K2.5Moonshot AI70.8%
5Trinity-Large-ThinkingArcee AI63.2%
sci Code
9 models
1Qwen3.7 MaxAlibaba53.5%
2Gemini 3.5 FlashGoogle53.1%
3Kimi K2.6Moonshot AI52.2%
4Kimi K2.5Moonshot AI48.7%
5Grok 4.3xAI47.3%
+4 moreaa Coding Index
119 models
1GPT-5.5OpenAI59.1%
2GPT-5.4OpenAI57.3%
3Gemini 3.1 ProGoogle55.5%
4GPT-5.3 CodexOpenAI53.1%
5Claude Opus 4.7Anthropic53.1%
+114 moreaa Sci Code
122 models
1Gemini 3.1 ProGoogle58.9%
2GPT-5.4OpenAI56.6%
3GPT-5.5OpenAI56.1%
4Gemini 3 ProGoogle56.1%
5GPT-5.2-CodexOpenAI54.6%
+117 moreterminal Bench Hard
115 models
1GPT-5.5OpenAI60.6%
2GPT-5.4OpenAI57.6%
3Claude Opus 4.7Anthropic54.5%
4Gemini 3.1 ProGoogle53.8%
5GPT-5.3 CodexOpenAI53.0%
+110 more