BenchLM Benchmarks
165 benchmarks · 940 model scores · Data from Jun 2, 2026
Knowledge27 benchmarks
mmlu
8 models
1o1OpenAI91.8%
2GPT-4.1OpenAI90.2%
3DeepSeek V4 Pro BaseDeepSeek90.1%
4DeepSeek V4 Flash BaseDeepSeek88.7%
5GPT-4.1 miniOpenAI87.5%
+3 moregpqa
54 models
1Claude Mythos PreviewAnthropic94.5%
2Claude Opus 4.7 (Adaptive)Anthropic94.2%
3Claude Opus 4.8Anthropic93.6%
4GPT-5.5OpenAI93.6%
5GPT-5.4OpenAI92.8%
+49 moresuper Gpqa
18 models
1Claude Opus 4.6Anthropic95%
2Claude Sonnet 4.6Anthropic95%
3Qwen 3.6 Max (preview)Alibaba73.9%
4Qwen3.7 MaxAlibaba73.6%
5Qwen3.6 PlusAlibaba71.6%
+13 moremmlu Pro
36 models
1Qwen3.7 MaxAlibaba89.6%
2Claude Opus 4.5Anthropic89.5%
3Qwen3.6 PlusAlibaba88.5%
4Qwen3.5 397BAlibaba87.8%
5DeepSeek V4 Pro (Max)DeepSeek87.5%
+31 morehle
36 models
1Claude Mythos PreviewAnthropic64.7%
2GPT-5.4 ProOpenAI58.7%
3Claude Opus 4.8Anthropic57.9%
4GPT-5.5 ProOpenAI57.2%
5Claude Opus 4.7 (Adaptive)Anthropic54.7%
+31 moreartificial Analysis
126 models
1GPT-5.5OpenAI60.2%
2Claude Opus 4.7 (Adaptive)Anthropic57.3%
3Gemini 3.1 ProGoogle57.2%
4GPT-5.4OpenAI56.8%
5Qwen3.7 MaxAlibaba56.6%
+121 moreaa Gpqa Diamond
122 models
1Gemini 3.1 ProGoogle94.1%
2GPT-5.5OpenAI93.5%
3Qwen3.7 MaxAlibaba92.3%
4Gemini 3.5 FlashGoogle92.2%
5GPT-5.4OpenAI92.0%
+117 moreaa Hle
122 models
1Gemini 3.1 ProGoogle44.7%
2GPT-5.5OpenAI44.3%
3GPT-5.4OpenAI41.6%
4Gemini 3.5 FlashGoogle41.0%
5GPT-5.3 CodexOpenAI39.9%
+117 moreaa Omniscience Index
114 models
1Gemini 3.1 ProGoogle32.9%
2Claude Opus 4.7 (Adaptive)Anthropic26.2%
3Gemini 3.5 FlashGoogle22.7%
4GPT-5.5OpenAI20.1%
5Grok 4.3xAI18.3%
+109 moreomniscience Accuracy
114 models
1GPT-5.5OpenAI56.9%
2Gemini 3 ProGoogle55.9%
3Gemini 3.1 ProGoogle55.3%
4Gemini 3.5 FlashGoogle51.9%
5GPT-5.3 CodexOpenAI51.8%
+109 moreomniscience Hallucination Rate
113 models
1Command A+Cohere14.1%
2Qwen3.7 MaxAlibaba22.9%
3MiMo-V2.5-ProXiaomi24.5%
4Grok 4.3xAI25.0%
5GLM-5.1Z.AI29.4%
+108 moresimple Qa
8 models
1DeepSeek V4 Pro (Max)DeepSeek57.9%
2DeepSeek V4 Pro BaseDeepSeek55.2%
3DeepSeek V4 Pro (High)DeepSeek46.2%
4DeepSeek V4 ProDeepSeek45%
5DeepSeek V4 Flash (Max)DeepSeek34.1%
+3 morechinese Simple Qa
6 models
1DeepSeek V4 Pro (Max)DeepSeek84.4%
2DeepSeek V4 Flash (Max)DeepSeek78.9%
3DeepSeek V4 Pro (High)DeepSeek77.7%
4DeepSeek V4 ProDeepSeek75.8%
5DeepSeek V4 Flash (High)DeepSeek73.2%
+1 morehealth Bench Hard
5 models
1Muse SparkMeta42.8%
2GPT-5.4OpenAI40.1%
3Gemini 3.1 ProGoogle20.6%
4Grok 4.20xAI20.3%
5Claude Opus 4.6Anthropic14.8%
med Xpert Qa Text
5 models
1Gemini 3.1 ProGoogle71.5%
2GPT-5.4OpenAI59.6%
3Muse SparkMeta52.6%
4Claude Opus 4.6Anthropic52.1%
5Grok 4.20xAI50.2%
hle No Tools
16 models
1Claude Mythos PreviewAnthropic56.8%
2Claude Opus 4.8Anthropic49.8%
3Claude Opus 4.7 (Adaptive)Anthropic46.9%
4Gemini 3.1 ProGoogle45.4%
5GPT-5.5 ProOpenAI43.1%
+11 moremmlu Pro Arcee
6 models
1Claude Opus 4.6Anthropic89.1%
2Kimi K2.5Moonshot AI87.1%
3GLM-5Z.AI85.8%
4Trinity-Large-ThinkingArcee AI83.4%
5MiniMax M2.7MiniMax80.8%
+1 moremmlu Redux
8 models
1Claude Opus 4.5Anthropic96.6%
2Qwen3.7 MaxAlibaba95%
3Qwen3.5 397BAlibaba94.9%
4Qwen3.6 PlusAlibaba94.5%
5Qwen3.6-27BAlibaba93.5%
+3 moremmmlu
4 models
1Interfaze BetaInterfaze90.9%
2Qwen3.7 MaxAlibaba90.3%
3DeepSeek V4 Pro BaseDeepSeek90.3%
4DeepSeek V4 Flash BaseDeepSeek88.8%