DeepSeek V4 Benchmarks: Full Score Breakdown (2026)

DeepSeek V4-Pro scores 93.5% on LiveCodeBench, 3206 on Codeforces, 90.1% on GPQA Diamond. Complete benchmark analysis across all modes and competitors.

DeepSeek V4 Benchmarks: How It Scores on LiveCodeBench, MMLU, SWE-bench, and More

DeepSeek V4 arrived on April 24, 2026, with bold claims: the best open-source model available, a top Codeforces rating, and near-frontier performance across reasoning, knowledge, and agentic tasks. Here's a complete analysis of every major benchmark result — separated by model variant and reasoning mode.

Understanding DeepSeek V4's Benchmark Modes

DeepSeek V4 reports results across six configurations:

Configuration	Description
V4-Flash Non-Think	Fast, no chain-of-thought
V4-Flash Think High	Moderate extended reasoning
V4-Flash Think Max	Maximum reasoning effort (Flash)
V4-Pro Non-Think	Fast, no chain-of-thought (Pro)
V4-Pro Think High	Moderate extended reasoning (Pro)
V4-Pro Think Max	Maximum reasoning — best overall results

Most competitive benchmarks report V4-Pro-Max results. That's the figure quoted whenever you see "DeepSeek V4" in headlines.

Coding Benchmarks

Benchmark	V4-Flash Max	V4-Pro Max	Opus 4.6 Max	GPT-5.4 xHigh	Gemini-3.1-Pro High
LiveCodeBench (Pass@1)	91.6%	93.5%	88.8%	N/A	91.7%
Codeforces Rating	3052	3206	N/A	3168	3052
HMMT 2026 Feb (Pass@1)	94.8%	95.2%	96.2%	97.7%	94.7%
IMOAnswerBench (Pass@1)	88.4%	89.8%	75.3%	91.4%	81.0%

Standout results:

V4-Pro-Max achieves the highest Codeforces rating of any model tested (3206), beating GPT-5.4 (3168) and Claude Opus 4.6 (N/A)
V4-Pro-Max leads on LiveCodeBench (93.5%) among the models with available data
On competition math (IMO), GPT-5.4 edges ahead (91.4% vs 89.8%)

Knowledge and Reasoning Benchmarks

Benchmark	V4-Flash Max	V4-Pro Max	Opus 4.6 Max	GPT-5.4 xHigh	Gemini-3.1-Pro High
MMLU-Pro (EM)	86.2%	87.5%	89.1%	87.5%	91.0%
GPQA Diamond (Pass@1)	88.1%	90.1%	91.3%	93.0%	94.3%
HLE (Pass@1)	34.8%	37.7%	40.0%	39.8%	44.4%
SimpleQA-Verified (Pass@1)	34.1%	57.9%	46.2%	45.3%	75.6%
Apex Shortlist (Pass@1)	85.7%	90.2%	85.9%	78.1%	89.1%

Key observations:

Gemini-3.1-Pro leads on most knowledge benchmarks (MMLU-Pro, GPQA Diamond, SimpleQA, HLE)
V4-Pro-Max leads on Apex Shortlist (90.2%) — a hard reasoning benchmark
V4-Pro-Max's SimpleQA score (57.9%) significantly beats Opus 4.6 (46.2%) and GPT-5.4 (45.3%) — indicating strong factual recall

Long-Context Benchmarks

Benchmark	V4-Flash Max	V4-Pro Max	Opus 4.6 Max	Gemini-3.1-Pro High
MRCR 1M (MMR)	78.7%	83.5%	92.9%	76.3%
CorpusQA 1M (ACC)	60.5%	62.0%	71.7%	53.8%

Analysis:

V4-Pro beats Gemini-3.1-Pro on CorpusQA 1M (62.0% vs 53.8%)
Claude Opus 4.6 leads MRCR 1M (92.9% vs 83.5%) — likely due to Claude's architecture optimizations for document retrieval
Both V4 models comfortably exceed Gemini on CorpusQA, making them strong for RAG workloads

Agentic Task Benchmarks

Benchmark	V4-Flash Max	V4-Pro Max	Opus 4.6 Max	GPT-5.4 xHigh	Gemini-3.1-Pro High
Terminal Bench 2.0 (Acc)	56.9%	67.9%	65.4%	75.1%	68.5%
SWE-bench Verified (Resolved)	79.0%	80.6%	80.8%	N/A	80.6%
SWE-bench Pro (Resolved)	52.6%	55.4%	57.3%	57.7%	54.2%
BrowseComp (Pass@1)	73.2%	83.4%	83.7%	82.7%	85.9%
MCPAtlas Public (Pass@1)	69.0%	73.6%	73.8%	67.2%	69.2%
Toolathlon (Pass@1)	47.8%	51.8%	47.2%	54.6%	48.8%

Standout results:

SWE-bench Verified: V4-Pro (80.6%) ties Gemini-3.1-Pro (80.6%) and nearly matches Opus 4.6 (80.8%) — remarkable for an open model
MCPAtlas: V4-Pro (73.6%) nearly matches Opus 4.6 (73.8%), the category leader
Terminal Bench 2.0: GPT-5.4 leads (75.1%), with V4-Pro behind at 67.9%

Base Model Benchmarks

The V4-Pro-Base (pre-trained, pre-instruction-tuning) results show impressive raw capability:

Benchmark	DS-V3.2-Base	V4-Flash-Base	V4-Pro-Base
MMLU (EM)	87.8%	88.7%	90.1%
MMLU-Redux (EM)	87.5%	89.4%	90.8%
GSM8K (EM)	91.1%	90.8%	92.6%
HumanEval (Pass@1)	62.8%	69.5%	76.8%
LongBench-V2 (EM)	40.2%	44.7%	51.5%

V4-Pro-Base consistently outperforms both V3.2-Base and V4-Flash-Base across all categories.

Summary: Where DeepSeek V4 Leads vs. Lags

V4-Pro-Max leads the field on:

Codeforces competitive programming (rating 3206)
LiveCodeBench (93.5%)
Apex Shortlist reasoning (90.2%)
SimpleQA factual recall (57.9%) vs. most non-Gemini models

V4-Pro-Max trails the field on:

GPQA Diamond (Gemini leads at 94.3%)
HLE hardest reasoning (Gemini leads at 44.4%)
MRCR 1M long context (Opus 4.6 leads at 92.9%)
Terminal Bench 2.0 agentic tasks (GPT-5.4 leads at 75.1%)

For AI-native platforms and tools like Framia.pro where coding, agentic tasks, and long-context comprehension are core use cases, DeepSeek V4-Pro's benchmark profile makes it one of the most compelling choices available in 2026.

Conclusion

DeepSeek V4-Pro is the best open-weight model across almost every benchmark category, and it competes meaningfully with every closed-source frontier model. Its most exceptional performance is in competitive coding, where it outperforms all other models tested. It trails slightly on the very hardest scientific reasoning and long-document retrieval tasks, but the gaps are narrowing.