context.vn

toolathlon

21 models evaluated

#ModelProviderTypeScore
1Claude Opus 4.8AnthropicClosed59.9%
2Gemini 3.5 FlashGoogleClosed56.5%
3GPT-5.5OpenAIClosed55.6%
4GPT-5.4OpenAIClosed54.6%
5DeepSeek V4 Pro (Max)DeepSeekOpen51.8%
6Kimi K2.6Moonshot AIOpen50%
7Step 3.7 FlashStepFunOpen49.5%
8DeepSeek V4 Pro (High)DeepSeekOpen49%
9DeepSeek V4 Flash (Max)DeepSeekOpen47.8%
10DeepSeek V4 ProDeepSeekOpen46.3%
11MiniMax M2.7MiniMaxOpen46.3%
12Claude Opus 4.5AnthropicClosed43.5%
13DeepSeek V4 Flash (High)DeepSeekOpen43.5%
14GPT-5.4 miniOpenAIClosed42.9%
15DeepSeek V4 FlashDeepSeekOpen40.7%
16Qwen3.6 PlusAlibabaClosed39.8%
17GLM-5Z.AIOpen38%
18Qwen3.5 397BAlibabaOpen36.3%
19GPT-5.4 nanoOpenAIClosed35.5%
20Kimi K2.5Moonshot AIOpen27.8%
21Qwen3.6-35B-A3BAlibabaOpen26.9%