toolathlon
21 models evaluated
|
| 1 | Claude Opus 4.8 | Anthropic | Closed | 59.9% | |
| 2 | Gemini 3.5 Flash | Google | Closed | 56.5% | |
| 3 | GPT-5.5 | OpenAI | Closed | 55.6% | |
| 4 | GPT-5.4 | OpenAI | Closed | 54.6% | |
| 5 | DeepSeek V4 Pro (Max) | DeepSeek | Open | 51.8% | |
| 6 | Kimi K2.6 | Moonshot AI | Open | 50% | |
| 7 | Step 3.7 Flash | StepFun | Open | 49.5% | |
| 8 | DeepSeek V4 Pro (High) | DeepSeek | Open | 49% | |
| 9 | DeepSeek V4 Flash (Max) | DeepSeek | Open | 47.8% | |
| 10 | DeepSeek V4 Pro | DeepSeek | Open | 46.3% | |
| 11 | MiniMax M2.7 | MiniMax | Open | 46.3% | |
| 12 | Claude Opus 4.5 | Anthropic | Closed | 43.5% | |
| 13 | DeepSeek V4 Flash (High) | DeepSeek | Open | 43.5% | |
| 14 | GPT-5.4 mini | OpenAI | Closed | 42.9% | |
| 15 | DeepSeek V4 Flash | DeepSeek | Open | 40.7% | |
| 16 | Qwen3.6 Plus | Alibaba | Closed | 39.8% | |
| 17 | GLM-5 | Z.AI | Open | 38% | |
| 18 | Qwen3.5 397B | Alibaba | Open | 36.3% | |
| 19 | GPT-5.4 nano | OpenAI | Closed | 35.5% | |
| 20 | Kimi K2.5 | Moonshot AI | Open | 27.8% | |
| 21 | Qwen3.6-35B-A3B | Alibaba | Open | 26.9% | |