context.vn

tau2 Bench

116 models evaluated

#ModelProviderTypeScore
1GLM-5V-TurboZ.AIClosed98.5%
2GLM-5-TurboZ.AIClosed98.5%
3GLM-5Z.AIOpen98.2%
4GLM-5.1Z.AIOpen97.7%
5Qwen3.6 PlusAlibabaClosed97.7%
6Grok 4.3xAIClosed97.7%
7DeepSeek V4 Pro (Max)DeepSeekOpen96.2%
8Kimi K2.6Moonshot AIOpen95.9%
9Kimi K2.5 (Reasoning)Moonshot AIClosed95.9%
10GLM-4.7Z.AIOpen95.9%
11Kimi K2.5Moonshot AIOpen95.9%
12Qwen 3.6 Max (preview)AlibabaClosed95.9%
13Gemini 3.1 ProGoogleClosed95.6%
14Qwen3.5 397B (Reasoning)AlibabaOpen95.6%
15DeepSeek V4 Flash (High)DeepSeekOpen95.6%
16Gemini 3.5 FlashGoogleClosed95.3%
17Qwen3.6-35B-A3BAlibabaOpen95.3%
18DeepSeek V4 Flash (Max)DeepSeekOpen95%
19MiMo-V2-ProXiaomiClosed95%
20Qwen3.7 MaxAlibabaClosed94.7%
21DeepSeek V4 Pro (High)DeepSeekOpen94.2%
22Qwen3.6-27BAlibabaOpen94.2%
23MiMo-V2.5-ProXiaomiClosed94.2%
24Mistral Medium 3.5 128BMistralOpen94.2%
25GPT-5.5OpenAIClosed93.9%
26Qwen3.5-27BAlibabaOpen93.9%
27Qwen3.5-122B-A10BAlibabaOpen93.6%
28Grok 4.1 Fast (Reasoning)xAIClosed93.3%
29Hy3 PreviewTencentOpen92.7%
30GPT-5.2-CodexOpenAIClosed92.1%
31Claude Opus 4.6 (Adaptive)AnthropicClosed92.1%
32Muse SparkMetaClosed91.5%
33MiMo-V2-OmniXiaomiClosed91.2%
34Trinity-Large-PreviewArcee AIOpen90.1%
35Trinity-Large-ThinkingArcee AIOpen90.1%
36Claude Opus 4.5 ThinkingAnthropicClosed89.5%
37Qwen3.5-35B-A3BAlibabaOpen89.2%
38Claude Opus 4.7 (Adaptive)AnthropicClosed88.6%
39LFM2.5-8B-A1BLiquidAIOpen88.1%
40GPT-5.4OpenAIClosed87.1%
41Gemini 3 ProGoogleClosed87.1%
42GPT-5 (medium)OpenAIClosed86.5%
43Claude Opus 4.5AnthropicClosed86.3%
44GPT-5.3 CodexOpenAIClosed86%
45Ling 2.6 FlashInclusionAIOpen86%
46Claude Opus 4.6AnthropicClosed84.8%
47GPT-5.2OpenAIClosed84.8%
48GPT-5 (high)OpenAIClosed84.8%
49MiniMax M2.7MiniMaxOpen84.8%
50Qwen3.5 397BAlibabaOpen83.9%
51MiMo-V2-FlashXiaomiOpen83.9%
52GPT-5.4 miniOpenAIClosed83.3%
53GPT-5.1-Codex-MaxOpenAIClosed83%
54GPT-5.1-CodexOpenAIClosed83%
55GPT-5.1OpenAIClosed81.9%
56o3OpenAIClosed80.7%
57Command A+CohereOpen80.7%
58Claude Sonnet 4.6AnthropicClosed79.5%
59DeepSeek V3.2DeepSeekOpen78.9%
60GLM-4.6Z.AIOpen76.9%
61GPT-5.4 nanoOpenAIClosed76%
62Grok Code Fast 1xAIClosed75.7%
63Grok 4xAIClosed74.9%
64K-ExaoneLG AI ResearchClosed74.3%
65Qwen3 MaxAlibabaClosed74.3%
66Claude Opus 4.7AnthropicClosed74%
67Claude 4.1 Opus ThinkingAnthropicClosed71.4%
68GPT-OSS 120BOpenAIOpen65.8%
69Grok 4 Fast (Reasoning)xAIClosed65.8%
70Grok 4.1 FastxAIClosed63.7%
71o1OpenAIClosed62.6%
72Kimi K2Moonshot AIClosed61.1%
73GPT-OSS 20BOpenAIOpen60.2%
74Gemma 4 31BGoogleOpen59.9%
75Gemini 2.5 ProGoogleClosed54.1%
76GPT-4.1 miniOpenAIClosed52.9%
77Claude 4 SonnetAnthropicClosed52.3%
78GPT-4.1OpenAIClosed47.1%
79Sarvam 105BSarvamOpen46.8%
80GLM-4.5-AirZ.AIClosed46.5%
81Nemotron 3 Nano Omni 30B A3BNVIDIAOpen45.3%
82Gemma 4 26B A4BGoogleOpen43.6%
83Gemini 3 FlashGoogleClosed43.3%
84Mistral Small 4 (Reasoning)MistralOpen41.2%
85Mistral Small 4MistralOpen41.2%
86DeepSeek V3.1 (Reasoning)DeepSeekOpen37.4%
87DeepSeek-R1DeepSeekOpen36.5%
88DeepSeek V3.1DeepSeekOpen34.8%
89Sarvam 30BSarvamOpen34.5%
90Solar Pro 2UpstageClosed31.9%
91Gemini 3.1 Flash-LiteGoogleClosed31.3%
92Mistral Large 2MistralClosed30.7%
93o3-miniOpenAIClosed28.7%
94Nemotron 3 Nano 30BNVIDIAOpen25.4%
95GPT-4oOpenAIClosed25.1%
96Mistral Large 3MistralClosed24.6%
97Mistral Medium 3MistralClosed24.3%
98DeepSeek V3DeepSeekOpen22.8%
99Granite-4.0-1BIBMOpen22.8%
100Claude 3 HaikuAnthropicClosed21.1%
101Gemma 4 E4BGoogleOpen20.8%
102Gemma 4 E2BGoogleOpen20.8%
103Exaone 4.0 1.2BLG AI ResearchOpen20.5%
104Granite-4.0-H-1BIBMOpen19.6%
105Llama 3.1 405BMetaOpen19%
106Llama 4 MaverickMetaOpen17.8%
107GPT-4.1 nanoOpenAIClosed17.3%
108Llama 4 ScoutMetaOpen15.5%
109Gemini 2.5 FlashGoogleClosed14.9%
110Granite-4.0-H-350MIBMOpen14.6%
111Nova ProAmazonClosed14%
112Granite-4.0-350MIBMOpen13.2%
113Nemotron Ultra 253BNVIDIAOpen11.4%
114Gemma 3 27BGoogleOpen10.5%
115Exaone 4.0 32BLG AI ResearchOpen4.1%
116Phi-4MicrosoftOpen0%