context.vn

BenchLM Benchmarks

165 benchmarks · 940 model scores · Data from Jun 2, 2026

Knowledge27 benchmarks

mmlu

8 models

1o1OpenAI91.8%
2GPT-4.1OpenAI90.2%
3DeepSeek V4 Pro BaseDeepSeek90.1%
4DeepSeek V4 Flash BaseDeepSeek88.7%
5GPT-4.1 miniOpenAI87.5%
+3 more
gpqa

54 models

1Claude Mythos PreviewAnthropic94.5%
2Claude Opus 4.7 (Adaptive)Anthropic94.2%
3Claude Opus 4.8Anthropic93.6%
4GPT-5.5OpenAI93.6%
5GPT-5.4OpenAI92.8%
+49 more
super Gpqa

18 models

1Claude Opus 4.6Anthropic95%
2Claude Sonnet 4.6Anthropic95%
3Qwen 3.6 Max (preview)Alibaba73.9%
4Qwen3.7 MaxAlibaba73.6%
5Qwen3.6 PlusAlibaba71.6%
+13 more
mmlu Pro

36 models

1Qwen3.7 MaxAlibaba89.6%
2Claude Opus 4.5Anthropic89.5%
3Qwen3.6 PlusAlibaba88.5%
4Qwen3.5 397BAlibaba87.8%
5DeepSeek V4 Pro (Max)DeepSeek87.5%
+31 more
agieval

2 models

1DeepSeek V4 Pro BaseDeepSeek83.1%
2DeepSeek V4 Flash BaseDeepSeek82.6%
hle

36 models

1Claude Mythos PreviewAnthropic64.7%
2GPT-5.4 ProOpenAI58.7%
3Claude Opus 4.8Anthropic57.9%
4GPT-5.5 ProOpenAI57.2%
5Claude Opus 4.7 (Adaptive)Anthropic54.7%
+31 more
frontier Science

1 models

1GPT-5.4 ProOpenAI36.7%
artificial Analysis

126 models

1GPT-5.5OpenAI60.2%
2Claude Opus 4.7 (Adaptive)Anthropic57.3%
3Gemini 3.1 ProGoogle57.2%
4GPT-5.4OpenAI56.8%
5Qwen3.7 MaxAlibaba56.6%
+121 more
aa Gpqa Diamond

122 models

1Gemini 3.1 ProGoogle94.1%
2GPT-5.5OpenAI93.5%
3Qwen3.7 MaxAlibaba92.3%
4Gemini 3.5 FlashGoogle92.2%
5GPT-5.4OpenAI92.0%
+117 more
aa Hle

122 models

1Gemini 3.1 ProGoogle44.7%
2GPT-5.5OpenAI44.3%
3GPT-5.4OpenAI41.6%
4Gemini 3.5 FlashGoogle41.0%
5GPT-5.3 CodexOpenAI39.9%
+117 more
aa Omniscience Index

114 models

1Gemini 3.1 ProGoogle32.9%
2Claude Opus 4.7 (Adaptive)Anthropic26.2%
3Gemini 3.5 FlashGoogle22.7%
4GPT-5.5OpenAI20.1%
5Grok 4.3xAI18.3%
+109 more
omniscience Accuracy

114 models

1GPT-5.5OpenAI56.9%
2Gemini 3 ProGoogle55.9%
3Gemini 3.1 ProGoogle55.3%
4Gemini 3.5 FlashGoogle51.9%
5GPT-5.3 CodexOpenAI51.8%
+109 more
omniscience Hallucination Rate

113 models

1Command A+Cohere14.1%
2Qwen3.7 MaxAlibaba22.9%
3MiMo-V2.5-ProXiaomi24.5%
4Grok 4.3xAI25.0%
5GLM-5.1Z.AI29.4%
+108 more
simple Qa

8 models

1DeepSeek V4 Pro (Max)DeepSeek57.9%
2DeepSeek V4 Pro BaseDeepSeek55.2%
3DeepSeek V4 Pro (High)DeepSeek46.2%
4DeepSeek V4 ProDeepSeek45%
5DeepSeek V4 Flash (Max)DeepSeek34.1%
+3 more
chinese Simple Qa

6 models

1DeepSeek V4 Pro (Max)DeepSeek84.4%
2DeepSeek V4 Flash (Max)DeepSeek78.9%
3DeepSeek V4 Pro (High)DeepSeek77.7%
4DeepSeek V4 ProDeepSeek75.8%
5DeepSeek V4 Flash (High)DeepSeek73.2%
+1 more
health Bench Hard

5 models

1Muse SparkMeta42.8%
2GPT-5.4OpenAI40.1%
3Gemini 3.1 ProGoogle20.6%
4Grok 4.20xAI20.3%
5Claude Opus 4.6Anthropic14.8%
med Xpert Qa Text

5 models

1Gemini 3.1 ProGoogle71.5%
2GPT-5.4OpenAI59.6%
3Muse SparkMeta52.6%
4Claude Opus 4.6Anthropic52.1%
5Grok 4.20xAI50.2%
frontier Science Research

1 models

1GPT-5.4 ProOpenAI36.7%
hle No Tools

16 models

1Claude Mythos PreviewAnthropic56.8%
2Claude Opus 4.8Anthropic49.8%
3Claude Opus 4.7 (Adaptive)Anthropic46.9%
4Gemini 3.1 ProGoogle45.4%
5GPT-5.5 ProOpenAI43.1%
+11 more
mmlu Pro Arcee

6 models

1Claude Opus 4.6Anthropic89.1%
2Kimi K2.5Moonshot AI87.1%
3GLM-5Z.AI85.8%
4Trinity-Large-ThinkingArcee AI83.4%
5MiniMax M2.7MiniMax80.8%
+1 more
mmlu Redux

8 models

1Claude Opus 4.5Anthropic96.6%
2Qwen3.7 MaxAlibaba95%
3Qwen3.5 397BAlibaba94.9%
4Qwen3.6 PlusAlibaba94.5%
5Qwen3.6-27BAlibaba93.5%
+3 more
mmmlu

4 models

1Interfaze BetaInterfaze90.9%
2Qwen3.7 MaxAlibaba90.3%
3DeepSeek V4 Pro BaseDeepSeek90.3%
4DeepSeek V4 Flash BaseDeepSeek88.8%
c Eval

7 models

1Qwen3.6 PlusAlibaba93.3%
2DeepSeek V4 Pro BaseDeepSeek93.1%
3Qwen3.5 397BAlibaba93%
4Claude Opus 4.5Anthropic92.2%
5DeepSeek V4 Flash BaseDeepSeek92.1%
+2 more
cmmlu

2 models

1DeepSeek V4 Pro BaseDeepSeek90.8%
2DeepSeek V4 Flash BaseDeepSeek90.4%
multi Lo Ko

2 models

1DeepSeek V4 Pro BaseDeepSeek51.1%
2DeepSeek V4 Flash BaseDeepSeek42.2%
facts Parametric

2 models

1DeepSeek V4 Pro BaseDeepSeek62.6%
2DeepSeek V4 Flash BaseDeepSeek33.9%
trivia Qa

2 models

1DeepSeek V4 Pro BaseDeepSeek85.6%
2DeepSeek V4 Flash BaseDeepSeek82.8%