context.vn

BenchLM Benchmarks

165 benchmarks · 109 model scores · Data from Jun 2, 2026

Math18 benchmarks

aime2024

1 models

1o3-miniOpenAI87.3%
aime2025

9 models

1Kimi K2.5 (Reasoning)Moonshot AI96.1%
2Kimi K2.5Moonshot AI96.1%
3GLM-4.7Z.AI95.7%
4MiMo-V2-FlashXiaomi94.1%
5Claude Sonnet 4.5Anthropic87%
+4 more
gsm8k

2 models

1DeepSeek V4 Pro BaseDeepSeek92.6%
2DeepSeek V4 Flash BaseDeepSeek90.8%
math Benchmark

2 models

1DeepSeek V4 Pro BaseDeepSeek64.5%
2DeepSeek V4 Flash BaseDeepSeek57.4%
cmath

2 models

1DeepSeek V4 Flash BaseDeepSeek93.6%
2DeepSeek V4 Pro BaseDeepSeek90.9%
aime2025 Arcee

6 models

1Claude Opus 4.6Anthropic99.8%
2Kimi K2.5Moonshot AI96.3%
3Trinity-Large-ThinkingArcee AI96.3%
4GLM-5Z.AI93.3%
5MiniMax M2.7MiniMax80.0%
+1 more
math500

2 models

1MiniCPM5-1BOpenBMB91.6%
2LFM2.5-8B-A1BLiquidAI88.8%
aime2026

13 models

1Kimi K2.6Moonshot AI96.4%
2GLM-5Z.AI95.8%
3Kimi K2.5Moonshot AI95.8%
4GLM-5.1Z.AI95.3%
5Qwen3.6 PlusAlibaba95.3%
+8 more
ipho2025 Theory

1 models

1GPT-5.4 ProOpenAI93.5%
hmmt Feb2025

7 models

1GLM-5Z.AI97.5%
2Qwen3.6 PlusAlibaba96.7%
3Kimi K2.5Moonshot AI95.4%
4Qwen3.5 397BAlibaba94.8%
5Qwen3.6-27BAlibaba93.8%
+2 more
hmmt Nov2025

8 models

1GLM-5Z.AI96.9%
2Qwen3.6 PlusAlibaba94.6%
3GLM-5.1Z.AI94.0%
4Claude Opus 4.5Anthropic93.3%
5Qwen3.5 397BAlibaba92.7%
+3 more
hmmt Feb2026

18 models

1Qwen3.7 MaxAlibaba97.1%
2DeepSeek V4 Pro (Max)DeepSeek95.2%
3DeepSeek V4 Flash (Max)DeepSeek94.8%
4DeepSeek V4 Pro (High)DeepSeek94.0%
5Kimi K2.6Moonshot AI92.7%
+13 more
imo Answer Bench

8 models

1Qwen3.7 MaxAlibaba90.0%
2DeepSeek V4 Pro (Max)DeepSeek89.8%
3DeepSeek V4 Flash (Max)DeepSeek88.4%
4DeepSeek V4 Pro (High)DeepSeek88.0%
5DeepSeek V4 Flash (High)DeepSeek85.1%
+3 more
apex

8 models

1Qwen3.7 MaxAlibaba44.5%
2DeepSeek V4 Pro (Max)DeepSeek38.3%
3DeepSeek V4 Flash (Max)DeepSeek33.0%
4ZAYA1-8BZyphra32.2%
5DeepSeek V4 Pro (High)DeepSeek27.4%
+3 more
apex Shortlist

6 models

1DeepSeek V4 Pro (Max)DeepSeek90.2%
2DeepSeek V4 Flash (Max)DeepSeek85.7%
3DeepSeek V4 Pro (High)DeepSeek85.5%
4DeepSeek V4 Flash (High)DeepSeek72.1%
5DeepSeek V4 FlashDeepSeek9.3%
+1 more
mm Answer Bench

9 models

1Kimi K2.6Moonshot AI86.0%
2Claude Opus 4.5Anthropic84.0%
3GLM-5.1Z.AI83.8%
4Qwen3.6 PlusAlibaba83.8%
5GLM-5Z.AI82.5%
+4 more
frontier Math

4 models

1GPT-5.5 ProOpenAI52.4%
2GPT-5.5OpenAI51.7%
3GPT-5.4 ProOpenAI50%
4Claude Opus 4.7 (Adaptive)Anthropic43.8%
usamo2026

3 models

1Claude Mythos PreviewAnthropic97.6%
2Claude Opus 4.8Anthropic96.7%
3MiniMax M3MiniMax85.7%