BenchLM Benchmarks
165 benchmarks · 845 model scores · Data from Jun 2, 2026
Agentic32 benchmarks
terminal Bench2
24 models
1GPT-5.5OpenAI82.0%
2Gemini 3.5 FlashGoogle76.2%
3Claude Opus 4.8Anthropic74.6%
4Qwen3.7 MaxAlibaba69.7%
5Claude Opus 4.7 (Adaptive)Anthropic69.4%
+19 morebrowse Comp
24 models
1GPT-5.5 ProOpenAI90.1%
2GPT-5.4 ProOpenAI89.3%
3Claude Mythos PreviewAnthropic86.9%
4GPT-5.5OpenAI84.4%
5Claude Opus 4.8Anthropic84.3%
+19 morehle With Tools
6 models
1Qwen3.7 MaxAlibaba53.5%
2DeepSeek V4 Pro (Max)DeepSeek48.2%
3Step 3.7 FlashStepFun47.2%
4DeepSeek V4 Flash (Max)DeepSeek45.1%
5DeepSeek V4 Pro (High)DeepSeek44.7%
+1 moregdpval Aa
113 models
1Claude Opus 4.8Anthropic1890
2GPT-5.5OpenAI1769
3Claude Opus 4.7 (Adaptive)Anthropic1753
4Claude Opus 4.7Anthropic1680
5GPT-5.4OpenAI1674
+108 moregdpval Aa Normalized
114 models
1GPT-5.5OpenAI63.5%
2Claude Opus 4.7 (Adaptive)Anthropic62.6%
3Claude Opus 4.7Anthropic59.0%
4GPT-5.4OpenAI58.7%
5Gemini 3.5 FlashGoogle57.8%
+109 moreaa Agentic Index
113 models
1GPT-5.5OpenAI74.1%
2Claude Opus 4.7 (Adaptive)Anthropic71.3%
3Gemini 3.5 FlashGoogle70.3%
4GPT-5.4OpenAI68.0%
5Claude Opus 4.6 (Adaptive)Anthropic67.6%
+108 moreapex Agents Aa
20 models
1Gemini 3.5 FlashGoogle47.1%
2GPT-5.5OpenAI37.7%
3GPT-5.4OpenAI33.3%
4Claude Opus 4.6 (Adaptive)Anthropic33.0%
5Gemini 3.1 ProGoogle32.0%
+15 moregert Labs
54 models
1Claude Opus 4.8Anthropic72.97%
2GPT-5.5OpenAI72.93%
3Claude Opus 4.7Anthropic65.59%
4GPT-5.4OpenAI64.89%
5Qwen3.7 MaxAlibaba64.27%
+49 moreos World Verified
21 models
1Claude Opus 4.8Anthropic83.4%
2Holo3-35B-A3BH Company82.6%
3Claude Mythos PreviewAnthropic79.6%
4Holo3-122B-A10BH Company78.8%
5GPT-5.5OpenAI78.7%
+16 morecyber Gym
10 models
1Claude Mythos PreviewAnthropic83.1%
2GPT-5.5OpenAI81.8%
3GPT-5.4OpenAI79.0%
4Claude Opus 4.7 (Adaptive)Anthropic73.1%
5GLM-5.1Z.AI68.7%
+5 moremcp Atlas
23 models
1Gemini 3.5 FlashGoogle83.6%
2Claude Opus 4.8Anthropic82.2%
3Claude Opus 4.7 (Adaptive)Anthropic77.3%
4Qwen3.7 MaxAlibaba76.4%
5GPT-5.5OpenAI75.3%
+18 moretoolathlon
21 models
1Claude Opus 4.8Anthropic59.9%
2Gemini 3.5 FlashGoogle56.5%
3GPT-5.5OpenAI55.6%
4GPT-5.4OpenAI54.6%
5DeepSeek V4 Pro (Max)DeepSeek51.8%
+16 moretau2 Bench
116 models
1GLM-5V-TurboZ.AI98.5%
2GLM-5-TurboZ.AI98.5%
3GLM-5Z.AI98.2%
4GLM-5.1Z.AI97.7%
5Qwen3.6 PlusAlibaba97.7%
+111 moredeep Search Qa
9 models
1Claude Opus 4.8Anthropic93.1%
2Step 3.7 FlashStepFun92.8%
3Kimi K2.6Moonshot AI92.5%
4Kimi K2.5Moonshot AI77.1%
5Muse SparkMeta74.8%
+4 morepinch Bench
46 models
3MiniMax M2.7MiniMaxminimax/minimax-m2.7
4Claude Opus 4.6anthropic/claude-opus-4.6Closed
5MiMo-V2-Omnixiaomi/mimo-v2-omniClosed
6GLM-5.1z-ai/glm-5.1Open
7Qwen3.5-122B-A10Bqwen/qwen3.5-122b-a10bOpen
+41 moreopen Hands Index
25 models
1Claude Opus 4.7 (Adaptive)Anthropicclaude-opus-4-7
2Claude Opus 4.6Anthropicclaude-opus-4-6
3GPT-5.5OpenAI65.9%
4GPT-5.4OpenAI64.3%
5Claude Opus 4.5claude-opus-4-5Closed
+20 moreswe Atlas Refactoring
10 models
2GPT-5.5OpenAIGpt-5.5 (Codex)
3GPT-5.4OpenAIGpt-5.4 (Codex)
4GPT-5.3 CodexGpt-5.3 (Codex)Closed
5Claude Opus 4.6Opus-4.6 (Claude Code)Closed
6Gemini 3.1 ProGemini-3.1-Pro (Gemini CLI)Closed
+5 moreinference Bench
14 models
1Claude Sonnet 4.6Anthropic8.08x
2GLM-5Z.AI6.20x
3Gemini 3.1 ProGoogle6.16x
4GPT-5.3 Codex (High)OpenAI5.48x
5GPT-5.4 (High)OpenAI5.08x
+9 morebfcl V4
5 models
1Qwen3.7 MaxAlibaba75.0%
2LFM2.5-8B-A1BLiquidAI49.7%
3ZAYA1-8BZyphra39.2%
4MiniCPM5-1BOpenBMB25.1%
5LFM2.5-VL-450MLiquidAI21.1%
claw Eval
22 models
1Claude Opus 4.6Anthropicopus46
2Claude Sonnet 4.6Anthropicsonnet46
3MiMo-V2.5-ProXiaomimimo_v25_pro
4Muse Sparkmuse_sparkClosed
5Kimi K2.6kimi_k26Open
+17 moreqwen Claw Bench
9 models
1Qwen3.7 MaxAlibaba64.3%
2Qwen 3.6 Max (preview)Alibaba59.0%
3Qwen3.6 PlusAlibaba57.2%
4Kimi K2.5Moonshot AI54.3%
5GLM-5Z.AI54.1%
+4 moreqwen Web Bench
4 models
1Qwen3.7 MaxAlibaba1568
2Qwen 3.6 Max (preview)Alibaba1532
3Qwen3.6-27BAlibaba1487
4Qwen3.6-35B-A3BAlibaba1397
tau3 Bench
9 models
1Mistral Medium 3.5 128BMistral91.4%
2MiMo-V2.5-ProXiaomi72.9%
3Qwen3.6 PlusAlibaba70.7%
4GLM-5.1Z.AI70.6%
5Claude Opus 4.5Anthropic70.2%
+4 morevita Bench
8 models
1Qwen3.7 MaxAlibaba47.9%
2Qwen3.6 PlusAlibaba44.3%
3Qwen3.5 397BAlibaba43.7%
4Qwen3.6-35B-A3BAlibaba35.6%
5Claude Opus 4.5Anthropic23.3%
+3 moredeep Planning
6 models
1Qwen3.6 PlusAlibaba41.5%
2Qwen3.5 397BAlibaba37.6%
3Claude Opus 4.5Anthropic26.4%
4Qwen3.6-35B-A3BAlibaba25.9%
5GLM-5Z.AI14.6%
+1 moremcp Tasks
5 models
1Qwen3.5 397BAlibaba74.2%
2Qwen3.6 PlusAlibaba74.1%
3Claude Opus 4.5Anthropic71.8%
4GLM-5Z.AI60.8%
5Kimi K2.5Moonshot AI59.1%
wide Research
7 models
1Kimi K2.6Moonshot AI80.8%
2Claude Opus 4.5Anthropic76.4%
3Qwen3.6 PlusAlibaba74.3%
4Qwen3.5 397BAlibaba74.0%
5Kimi K2.5Moonshot AI72.7%
+2 more