Back to Home
Performance Metrics

AI Benchmarks

Compare the world's leading LLMs across key performance indicators including reasoning, coding, math, and language understanding.

Overall Performance Rankings

1

GPT-4 Turbo

OpenAI

94.2%

Average Score

2

Claude 3 Opus

Anthropic

92.7%

Average Score

3

Gemini 1.5 Pro

Google

90.1%

Average Score

4

Claude 3.5 Sonnet

Anthropic

88.5%

Average Score

5

Llama 3.1 405B

Meta

85.3%

Average Score

Performance by Category

Coding (HumanEval)

Code generation accuracy

GPT-4 Turbo91.2%
Claude 3 Opus88.7%
Gemini Pro84.3%

Math (MATH)

Mathematical reasoning

GPT-4 Turbo89.5%
Claude 3 Opus87.1%
Gemini Pro82.8%

Language (MMLU)

General knowledge & reasoning

GPT-4 Turbo86.4%
Claude 3 Opus86.8%
Gemini Pro83.7%

Speed

Tokens per second

GPT-4 Turbo102 tok/s
Claude 3 Sonnet150 tok/s
Gemini Pro88 tok/s

Context Window Sizes

GPT-4 Turbo

128K

tokens

Claude 3 Opus

200K

tokens

Gemini 1.5 Pro

2M

tokens

Claude 3.5 Sonnet

200K

tokens

Benchmark Methodology

Our benchmarks aggregate scores from industry-standard evaluation datasets including HumanEval (code generation), MATH (mathematical reasoning), MMLU (multitask language understanding), and internal speed tests. All models tested with default parameters at temperature 0.0 for consistency.

Last Updated

December 2024

Test Environment

Standard API calls, 0.0 temperature