DeepSeek V4 Benchmarks: How It Scores on LiveCodeBench, MMLU, SWE-bench, and More

DeepSeek V4-Pro scores 93.5% on LiveCodeBench, 3206 on Codeforces, 90.1% on GPQA Diamond. Complete benchmark analysis across all modes and competitors.

by Framia

DeepSeek V4 Benchmarks: How It Scores on LiveCodeBench, MMLU, SWE-bench, and More

DeepSeek V4 arrived on April 24, 2026, with bold claims: the best open-source model available, a top Codeforces rating, and near-frontier performance across reasoning, knowledge, and agentic tasks. Here's a complete analysis of every major benchmark result — separated by model variant and reasoning mode.


Understanding DeepSeek V4's Benchmark Modes

DeepSeek V4 reports results across six configurations:

Configuration Description
V4-Flash Non-Think Fast, no chain-of-thought
V4-Flash Think High Moderate extended reasoning
V4-Flash Think Max Maximum reasoning effort (Flash)
V4-Pro Non-Think Fast, no chain-of-thought (Pro)
V4-Pro Think High Moderate extended reasoning (Pro)
V4-Pro Think Max Maximum reasoning — best overall results

Most competitive benchmarks report V4-Pro-Max results. That's the figure quoted whenever you see "DeepSeek V4" in headlines.


Coding Benchmarks

Benchmark V4-Flash Max V4-Pro Max Opus 4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High
LiveCodeBench (Pass@1) 91.6% 93.5% 88.8% N/A 91.7%
Codeforces Rating 3052 3206 N/A 3168 3052
HMMT 2026 Feb (Pass@1) 94.8% 95.2% 96.2% 97.7% 94.7%
IMOAnswerBench (Pass@1) 88.4% 89.8% 75.3% 91.4% 81.0%

Standout results:

  • V4-Pro-Max achieves the highest Codeforces rating of any model tested (3206), beating GPT-5.4 (3168) and Claude Opus 4.6 (N/A)
  • V4-Pro-Max leads on LiveCodeBench (93.5%) among the models with available data
  • On competition math (IMO), GPT-5.4 edges ahead (91.4% vs 89.8%)

Knowledge and Reasoning Benchmarks

Benchmark V4-Flash Max V4-Pro Max Opus 4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High
MMLU-Pro (EM) 86.2% 87.5% 89.1% 87.5% 91.0%
GPQA Diamond (Pass@1) 88.1% 90.1% 91.3% 93.0% 94.3%
HLE (Pass@1) 34.8% 37.7% 40.0% 39.8% 44.4%
SimpleQA-Verified (Pass@1) 34.1% 57.9% 46.2% 45.3% 75.6%
Apex Shortlist (Pass@1) 85.7% 90.2% 85.9% 78.1% 89.1%

Key observations:

  • Gemini-3.1-Pro leads on most knowledge benchmarks (MMLU-Pro, GPQA Diamond, SimpleQA, HLE)
  • V4-Pro-Max leads on Apex Shortlist (90.2%) — a hard reasoning benchmark
  • V4-Pro-Max's SimpleQA score (57.9%) significantly beats Opus 4.6 (46.2%) and GPT-5.4 (45.3%) — indicating strong factual recall

Long-Context Benchmarks

Benchmark V4-Flash Max V4-Pro Max Opus 4.6 Max Gemini-3.1-Pro High
MRCR 1M (MMR) 78.7% 83.5% 92.9% 76.3%
CorpusQA 1M (ACC) 60.5% 62.0% 71.7% 53.8%

Analysis:

  • V4-Pro beats Gemini-3.1-Pro on CorpusQA 1M (62.0% vs 53.8%)
  • Claude Opus 4.6 leads MRCR 1M (92.9% vs 83.5%) — likely due to Claude's architecture optimizations for document retrieval
  • Both V4 models comfortably exceed Gemini on CorpusQA, making them strong for RAG workloads

Agentic Task Benchmarks

Benchmark V4-Flash Max V4-Pro Max Opus 4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High
Terminal Bench 2.0 (Acc) 56.9% 67.9% 65.4% 75.1% 68.5%
SWE-bench Verified (Resolved) 79.0% 80.6% 80.8% N/A 80.6%
SWE-bench Pro (Resolved) 52.6% 55.4% 57.3% 57.7% 54.2%
BrowseComp (Pass@1) 73.2% 83.4% 83.7% 82.7% 85.9%
MCPAtlas Public (Pass@1) 69.0% 73.6% 73.8% 67.2% 69.2%
Toolathlon (Pass@1) 47.8% 51.8% 47.2% 54.6% 48.8%

Standout results:

  • SWE-bench Verified: V4-Pro (80.6%) ties Gemini-3.1-Pro (80.6%) and nearly matches Opus 4.6 (80.8%) — remarkable for an open model
  • MCPAtlas: V4-Pro (73.6%) nearly matches Opus 4.6 (73.8%), the category leader
  • Terminal Bench 2.0: GPT-5.4 leads (75.1%), with V4-Pro behind at 67.9%

Base Model Benchmarks

The V4-Pro-Base (pre-trained, pre-instruction-tuning) results show impressive raw capability:

Benchmark DS-V3.2-Base V4-Flash-Base V4-Pro-Base
MMLU (EM) 87.8% 88.7% 90.1%
MMLU-Redux (EM) 87.5% 89.4% 90.8%
GSM8K (EM) 91.1% 90.8% 92.6%
HumanEval (Pass@1) 62.8% 69.5% 76.8%
LongBench-V2 (EM) 40.2% 44.7% 51.5%

V4-Pro-Base consistently outperforms both V3.2-Base and V4-Flash-Base across all categories.


Summary: Where DeepSeek V4 Leads vs. Lags

V4-Pro-Max leads the field on:

  • Codeforces competitive programming (rating 3206)
  • LiveCodeBench (93.5%)
  • Apex Shortlist reasoning (90.2%)
  • SimpleQA factual recall (57.9%) vs. most non-Gemini models

V4-Pro-Max trails the field on:

  • GPQA Diamond (Gemini leads at 94.3%)
  • HLE hardest reasoning (Gemini leads at 44.4%)
  • MRCR 1M long context (Opus 4.6 leads at 92.9%)
  • Terminal Bench 2.0 agentic tasks (GPT-5.4 leads at 75.1%)

For AI-native platforms and tools like Framia.pro where coding, agentic tasks, and long-context comprehension are core use cases, DeepSeek V4-Pro's benchmark profile makes it one of the most compelling choices available in 2026.


Conclusion

DeepSeek V4-Pro is the best open-weight model across almost every benchmark category, and it competes meaningfully with every closed-source frontier model. Its most exceptional performance is in competitive coding, where it outperforms all other models tested. It trails slightly on the very hardest scientific reasoning and long-document retrieval tasks, but the gaps are narrowing.