Benchmarks:
MMLU
Algemene kennis
HumanEval
Code generatie
MATH
Wiskundig redeneren
GPQA
Expert kennis
ARC-C
Logisch redeneren
HellaSwag
Common sense
MBPP
Basis Python
MMMU
Vision + tekst
20 modellen
Model Provider MMLU HumanEval MATH GPQA ARC-C HellaSwag MBPP MMMU Notities
GPT-4o api OpenAI 88.7 90.2 76.6 53.6 96.4 95.3 90.5 69.1 Flagship multimodal, snel
GPT-4o mini api OpenAI 82.0 87.0 70.2 40.2 93.1 92.5 85.7 59.4 Beste prijs/kwaliteit
o1 api OpenAI 92.3 94.8 94.8 78.0 97.8 96.1 93.4 78.2 Reasoning model, langzaam
o1-mini api OpenAI 85.2 92.4 90.0 60.0 95.2 93.8 90.1 60.0 Sneller reasoning
Claude 3.5 Sonnet api Anthropic 88.7 92.0 78.3 59.4 96.7 95.8 91.0 68.3 Beste voor coding
Claude 3.5 Haiku api Anthropic 75.2 88.1 69.2 41.6 91.2 89.4 85.3 52.1 Snel en goedkoop
Claude 3 Opus api Anthropic 86.8 84.9 60.1 50.4 95.4 94.2 86.2 59.4 Sterk in analyse
Gemini 2.0 Flash api Google 85.0 89.0 73.0 49.0 94.5 93.2 87.8 64.2 Zeer goedkoop, 1M context
Gemini 1.5 Pro api Google 86.5 84.1 74.3 46.2 95.0 93.5 85.4 62.2 2M context, video
Grok-2 api xAI 87.5 88.4 76.1 56.0 95.8 94.6 88.2 66.7 Realtime data, ongefilterd
DeepSeek V3 api DeepSeek 88.5 82.6 90.2 59.1 95.5 94.0 84.8 49.5 Beste MATH, zeer goedkoop
DeepSeek R1 api DeepSeek 90.8 92.8 97.3 71.5 96.8 95.2 91.6 51.2 Reasoning model, open-source
Mistral Large api Mistral 84.0 81.2 70.0 45.3 93.4 91.8 82.4 52.0 Europees, GDPR
Llama 3.3 70B local Meta 86.0 88.4 77.0 50.7 94.8 93.6 86.7 60.3 Beste open-source
Llama 3.1 405B local Meta 88.6 89.0 73.8 51.1 96.1 95.2 88.4 64.5 Grootste open model
Qwen 2.5 72B local Alibaba 86.1 86.4 83.1 49.0 94.2 92.8 85.0 58.2 Sterk in math
Qwen 2.5 Coder 32B local Alibaba 74.2 92.7 76.5 38.4 89.5 87.2 90.2 42.0 Specialist voor code
Phi-4 local Microsoft 84.8 82.6 80.4 56.1 94.5 93.0 84.5 58.8 Klein maar krachtig (14B)
Gemma 2 27B local Google 75.2 64.4 52.4 34.2 88.5 86.2 72.0 46.8 Google's open model
Mixtral 8x22B local Mistral 77.8 75.0 49.8 36.2 91.2 89.4 78.5 48.2 MoE architectuur
90+ Top tier 80+ Excellent 70+ Goed <70 Basis