Vergelijk 20 LLM modellen op 8 standaard benchmarks.
| Model | Provider | MMLU | HumanEval | MATH | GPQA | ARC-C | HellaSwag | MBPP | MMMU | Notities |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o api | OpenAI | 88.7 | 90.2 | 76.6 | 53.6 | 96.4 | 95.3 | 90.5 | 69.1 | Flagship multimodal, snel |
| GPT-4o mini api | OpenAI | 82.0 | 87.0 | 70.2 | 40.2 | 93.1 | 92.5 | 85.7 | 59.4 | Beste prijs/kwaliteit |
| o1 api | OpenAI | 92.3 | 94.8 | 94.8 | 78.0 | 97.8 | 96.1 | 93.4 | 78.2 | Reasoning model, langzaam |
| o1-mini api | OpenAI | 85.2 | 92.4 | 90.0 | 60.0 | 95.2 | 93.8 | 90.1 | 60.0 | Sneller reasoning |
| Claude 3.5 Sonnet api | Anthropic | 88.7 | 92.0 | 78.3 | 59.4 | 96.7 | 95.8 | 91.0 | 68.3 | Beste voor coding |
| Claude 3.5 Haiku api | Anthropic | 75.2 | 88.1 | 69.2 | 41.6 | 91.2 | 89.4 | 85.3 | 52.1 | Snel en goedkoop |
| Claude 3 Opus api | Anthropic | 86.8 | 84.9 | 60.1 | 50.4 | 95.4 | 94.2 | 86.2 | 59.4 | Sterk in analyse |
| Gemini 2.0 Flash api | 85.0 | 89.0 | 73.0 | 49.0 | 94.5 | 93.2 | 87.8 | 64.2 | Zeer goedkoop, 1M context | |
| Gemini 1.5 Pro api | 86.5 | 84.1 | 74.3 | 46.2 | 95.0 | 93.5 | 85.4 | 62.2 | 2M context, video | |
| Grok-2 api | xAI | 87.5 | 88.4 | 76.1 | 56.0 | 95.8 | 94.6 | 88.2 | 66.7 | Realtime data, ongefilterd |
| DeepSeek V3 api | DeepSeek | 88.5 | 82.6 | 90.2 | 59.1 | 95.5 | 94.0 | 84.8 | 49.5 | Beste MATH, zeer goedkoop |
| DeepSeek R1 api | DeepSeek | 90.8 | 92.8 | 97.3 | 71.5 | 96.8 | 95.2 | 91.6 | 51.2 | Reasoning model, open-source |
| Mistral Large api | Mistral | 84.0 | 81.2 | 70.0 | 45.3 | 93.4 | 91.8 | 82.4 | 52.0 | Europees, GDPR |
| Llama 3.3 70B local | 86.0 | 88.4 | 77.0 | 50.7 | 94.8 | 93.6 | 86.7 | 60.3 | Beste open-source | |
| Llama 3.1 405B local | 88.6 | 89.0 | 73.8 | 51.1 | 96.1 | 95.2 | 88.4 | 64.5 | Grootste open model | |
| Qwen 2.5 72B local | Alibaba | 86.1 | 86.4 | 83.1 | 49.0 | 94.2 | 92.8 | 85.0 | 58.2 | Sterk in math |
| Qwen 2.5 Coder 32B local | Alibaba | 74.2 | 92.7 | 76.5 | 38.4 | 89.5 | 87.2 | 90.2 | 42.0 | Specialist voor code |
| Phi-4 local | Microsoft | 84.8 | 82.6 | 80.4 | 56.1 | 94.5 | 93.0 | 84.5 | 58.8 | Klein maar krachtig (14B) |
| Gemma 2 27B local | 75.2 | 64.4 | 52.4 | 34.2 | 88.5 | 86.2 | 72.0 | 46.8 | Google's open model | |
| Mixtral 8x22B local | Mistral | 77.8 | 75.0 | 49.8 | 36.2 | 91.2 | 89.4 | 78.5 | 48.2 | MoE architectuur |