context.vn

BenchLM Benchmarks

165 benchmarks · 845 model scores · Data from Jun 2, 2026

Agentic32 benchmarks

terminal Bench2

24 models

1GPT-5.5OpenAI82.0%
2Gemini 3.5 FlashGoogle76.2%
3Claude Opus 4.8Anthropic74.6%
4Qwen3.7 MaxAlibaba69.7%
5Claude Opus 4.7 (Adaptive)Anthropic69.4%
+19 more
browse Comp

24 models

1GPT-5.5 ProOpenAI90.1%
2GPT-5.4 ProOpenAI89.3%
3Claude Mythos PreviewAnthropic86.9%
4GPT-5.5OpenAI84.4%
5Claude Opus 4.8Anthropic84.3%
+19 more
hle With Tools

6 models

1Qwen3.7 MaxAlibaba53.5%
2DeepSeek V4 Pro (Max)DeepSeek48.2%
3Step 3.7 FlashStepFun47.2%
4DeepSeek V4 Flash (Max)DeepSeek45.1%
5DeepSeek V4 Pro (High)DeepSeek44.7%
+1 more
gdpval Aa

113 models

1Claude Opus 4.8Anthropic1890
2GPT-5.5OpenAI1769
3Claude Opus 4.7 (Adaptive)Anthropic1753
4Claude Opus 4.7Anthropic1680
5GPT-5.4OpenAI1674
+108 more
gdpval Aa Normalized

114 models

1GPT-5.5OpenAI63.5%
2Claude Opus 4.7 (Adaptive)Anthropic62.6%
3Claude Opus 4.7Anthropic59.0%
4GPT-5.4OpenAI58.7%
5Gemini 3.5 FlashGoogle57.8%
+109 more
aa Agentic Index

113 models

1GPT-5.5OpenAI74.1%
2Claude Opus 4.7 (Adaptive)Anthropic71.3%
3Gemini 3.5 FlashGoogle70.3%
4GPT-5.4OpenAI68.0%
5Claude Opus 4.6 (Adaptive)Anthropic67.6%
+108 more
apex Agents Aa

20 models

1Gemini 3.5 FlashGoogle47.1%
2GPT-5.5OpenAI37.7%
3GPT-5.4OpenAI33.3%
4Claude Opus 4.6 (Adaptive)Anthropic33.0%
5Gemini 3.1 ProGoogle32.0%
+15 more
gert Labs

54 models

1Claude Opus 4.8Anthropic72.97%
2GPT-5.5OpenAI72.93%
3Claude Opus 4.7Anthropic65.59%
4GPT-5.4OpenAI64.89%
5Qwen3.7 MaxAlibaba64.27%
+49 more
os World Verified

21 models

1Claude Opus 4.8Anthropic83.4%
2Holo3-35B-A3BH Company82.6%
3Claude Mythos PreviewAnthropic79.6%
4Holo3-122B-A10BH Company78.8%
5GPT-5.5OpenAI78.7%
+16 more
cyber Gym

10 models

1Claude Mythos PreviewAnthropic83.1%
2GPT-5.5OpenAI81.8%
3GPT-5.4OpenAI79.0%
4Claude Opus 4.7 (Adaptive)Anthropic73.1%
5GLM-5.1Z.AI68.7%
+5 more
os World

2 models

1Claude Opus 4.5Anthropic66.3%
2Nemotron 3 Nano Omni 30B A3BNVIDIA47.4%
android World

1 models

1Qwen3.6-27BAlibaba70.3%
mcp Atlas

23 models

1Gemini 3.5 FlashGoogle83.6%
2Claude Opus 4.8Anthropic82.2%
3Claude Opus 4.7 (Adaptive)Anthropic77.3%
4Qwen3.7 MaxAlibaba76.4%
5GPT-5.5OpenAI75.3%
+18 more
toolathlon

21 models

1Claude Opus 4.8Anthropic59.9%
2Gemini 3.5 FlashGoogle56.5%
3GPT-5.5OpenAI55.6%
4GPT-5.4OpenAI54.6%
5DeepSeek V4 Pro (Max)DeepSeek51.8%
+16 more
tau2 Bench

116 models

1GLM-5V-TurboZ.AI98.5%
2GLM-5-TurboZ.AI98.5%
3GLM-5Z.AI98.2%
4GLM-5.1Z.AI97.7%
5Qwen3.6 PlusAlibaba97.7%
+111 more
deep Search Qa

9 models

1Claude Opus 4.8Anthropic93.1%
2Step 3.7 FlashStepFun92.8%
3Kimi K2.6Moonshot AI92.5%
4Kimi K2.5Moonshot AI77.1%
5Muse SparkMeta74.8%
+4 more
tau2 Airline

1 models

1ZAYA1-74B-PreviewZyphra56.1%
pinch Bench

46 models

3MiniMax M2.7MiniMaxminimax/minimax-m2.7
4Claude Opus 4.6anthropic/claude-opus-4.6Closed
5MiMo-V2-Omnixiaomi/mimo-v2-omniClosed
6GLM-5.1z-ai/glm-5.1Open
7Qwen3.5-122B-A10Bqwen/qwen3.5-122b-a10bOpen
+41 more
open Hands Index

25 models

1Claude Opus 4.7 (Adaptive)Anthropicclaude-opus-4-7
2Claude Opus 4.6Anthropicclaude-opus-4-6
3GPT-5.5OpenAI65.9%
4GPT-5.4OpenAI64.3%
5Claude Opus 4.5claude-opus-4-5Closed
+20 more
swe Atlas Refactoring

10 models

2GPT-5.5OpenAIGpt-5.5 (Codex)
3GPT-5.4OpenAIGpt-5.4 (Codex)
4GPT-5.3 CodexGpt-5.3 (Codex)Closed
5Claude Opus 4.6Opus-4.6 (Claude Code)Closed
6Gemini 3.1 ProGemini-3.1-Pro (Gemini CLI)Closed
+5 more
inference Bench

14 models

1Claude Sonnet 4.6Anthropic8.08x
2GLM-5Z.AI6.20x
3Gemini 3.1 ProGoogle6.16x
4GPT-5.3 Codex (High)OpenAI5.48x
5GPT-5.4 (High)OpenAI5.08x
+9 more
bfcl V4

5 models

1Qwen3.7 MaxAlibaba75.0%
2LFM2.5-8B-A1BLiquidAI49.7%
3ZAYA1-8BZyphra39.2%
4MiniCPM5-1BOpenBMB25.1%
5LFM2.5-VL-450MLiquidAI21.1%
mle Bench Lite

1 models

1MiniMax M2.7MiniMax66.6%
mm Claw Bench

2 models

1MiniMax M2.7MiniMax62.7%
2MiMo-V2.5Xiaomi23.8%
claw Eval

22 models

1Claude Opus 4.6Anthropicopus46
2Claude Sonnet 4.6Anthropicsonnet46
3MiMo-V2.5-ProXiaomimimo_v25_pro
4Muse Sparkmuse_sparkClosed
5Kimi K2.6kimi_k26Open
+17 more
qwen Claw Bench

9 models

1Qwen3.7 MaxAlibaba64.3%
2Qwen 3.6 Max (preview)Alibaba59.0%
3Qwen3.6 PlusAlibaba57.2%
4Kimi K2.5Moonshot AI54.3%
5GLM-5Z.AI54.1%
+4 more
qwen Web Bench

4 models

1Qwen3.7 MaxAlibaba1568
2Qwen 3.6 Max (preview)Alibaba1532
3Qwen3.6-27BAlibaba1487
4Qwen3.6-35B-A3BAlibaba1397
tau3 Bench

9 models

1Mistral Medium 3.5 128BMistral91.4%
2MiMo-V2.5-ProXiaomi72.9%
3Qwen3.6 PlusAlibaba70.7%
4GLM-5.1Z.AI70.6%
5Claude Opus 4.5Anthropic70.2%
+4 more
vita Bench

8 models

1Qwen3.7 MaxAlibaba47.9%
2Qwen3.6 PlusAlibaba44.3%
3Qwen3.5 397BAlibaba43.7%
4Qwen3.6-35B-A3BAlibaba35.6%
5Claude Opus 4.5Anthropic23.3%
+3 more
deep Planning

6 models

1Qwen3.6 PlusAlibaba41.5%
2Qwen3.5 397BAlibaba37.6%
3Claude Opus 4.5Anthropic26.4%
4Qwen3.6-35B-A3BAlibaba25.9%
5GLM-5Z.AI14.6%
+1 more
mcp Tasks

5 models

1Qwen3.5 397BAlibaba74.2%
2Qwen3.6 PlusAlibaba74.1%
3Claude Opus 4.5Anthropic71.8%
4GLM-5Z.AI60.8%
5Kimi K2.5Moonshot AI59.1%
wide Research

7 models

1Kimi K2.6Moonshot AI80.8%
2Claude Opus 4.5Anthropic76.4%
3Qwen3.6 PlusAlibaba74.3%
4Qwen3.5 397BAlibaba74.0%
5Kimi K2.5Moonshot AI72.7%
+2 more