기존 MMLU의 고급 버전. 단순 암기보다 추론 능력을 더 강조하도록 설계되었으며, 14개 분야에 걸쳐 12,000개 이상의 문제로 구성됩니다. 점수는 정답률(%)입니다.
출처: Artificial Analysis| 순위 | 모델 | |
|---|---|---|
| #1 | Anthropic Claude Opus 4.5 | 89.5% |
| #2 | Gemini 3 Flash | 89.0% |
| #3 | Anthropic Claude Opus 4.1 | 88.0% |
| #4 | Anthropic Claude Sonnet 4.5 | 87.5% |
| #5 | Anthropic Claude Opus 4 | 87.3% |
| #6 | DeepSeek DeepSeek V3.2 | 86.2% |
| #7 | Gemini 2.5 Pro | 86.2% |
| #8 | Grok Grok 4.1 Fast (Reasoning) | 85.4% |
| #9 | Anthropic Claude Sonnet 4 | 84.2% |
| #10 | Gemini 2.5 Flash | 83.2% |
| #11 | Baidu ERNIE 5.0 Thinking | 83.0% |
| #12 | ChatGPT GPT-5 | 82.0% |
| #13 | Meta AI Llama 4 Maverick | 80.9% |
| #14 | ChatGPT GPT-4.1 | 80.6% |
| #15 | Gemini 2.5 Flash Lite | 79.6% |
| #16 | Baidu ERNIE 4.5 300B A47B | 77.6% |
| #17 | ChatGPT GPT OSS 120B | 77.5% |
| #18 | ChatGPT GPT-5 Mini | 77.5% |
| #19 | ChatGPT GPT-5 Nano | 77.2% |
| #20 | Anthropic Claude Haiku 4.5 | 76.0% |
| #21 | Meta AI Llama 4 Scout | 75.2% |
| #22 | Grok Grok 4.1 Fast | 74.3% |
| #23 | Amazon Nova 2 Lite | 74.3% |
| #24 | Mistral AI Mistral Small 4 | 41.9% |