DeepSeek V4 Benchmarks: How It Scores on LiveCodeBench, MMLU, SWE-bench, and More
DeepSeek V4 arrived on April 24, 2026, with bold claims: the best open-source model available, a top Codeforces rating, and near-frontier performance across reasoning, knowledge, and agentic tasks. Here's a complete analysis of every major benchmark result — separated by model variant and reasoning mode.
Understanding DeepSeek V4's Benchmark Modes
DeepSeek V4 reports results across six configurations:
| Configuration | Description |
|---|---|
| V4-Flash Non-Think | Fast, no chain-of-thought |
| V4-Flash Think High | Moderate extended reasoning |
| V4-Flash Think Max | Maximum reasoning effort (Flash) |
| V4-Pro Non-Think | Fast, no chain-of-thought (Pro) |
| V4-Pro Think High | Moderate extended reasoning (Pro) |
| V4-Pro Think Max | Maximum reasoning — best overall results |
Most competitive benchmarks report V4-Pro-Max results. That's the figure quoted whenever you see "DeepSeek V4" in headlines.
Coding Benchmarks
| Benchmark | V4-Flash Max | V4-Pro Max | Opus 4.6 Max | GPT-5.4 xHigh | Gemini-3.1-Pro High |
|---|---|---|---|---|---|
| LiveCodeBench (Pass@1) | 91.6% | 93.5% | 88.8% | N/A | 91.7% |
| Codeforces Rating | 3052 | 3206 | N/A | 3168 | 3052 |
| HMMT 2026 Feb (Pass@1) | 94.8% | 95.2% | 96.2% | 97.7% | 94.7% |
| IMOAnswerBench (Pass@1) | 88.4% | 89.8% | 75.3% | 91.4% | 81.0% |
Standout results:
- V4-Pro-Max achieves the highest Codeforces rating of any model tested (3206), beating GPT-5.4 (3168) and Claude Opus 4.6 (N/A)
- V4-Pro-Max leads on LiveCodeBench (93.5%) among the models with available data
- On competition math (IMO), GPT-5.4 edges ahead (91.4% vs 89.8%)
Knowledge and Reasoning Benchmarks
| Benchmark | V4-Flash Max | V4-Pro Max | Opus 4.6 Max | GPT-5.4 xHigh | Gemini-3.1-Pro High |
|---|---|---|---|---|---|
| MMLU-Pro (EM) | 86.2% | 87.5% | 89.1% | 87.5% | 91.0% |
| GPQA Diamond (Pass@1) | 88.1% | 90.1% | 91.3% | 93.0% | 94.3% |
| HLE (Pass@1) | 34.8% | 37.7% | 40.0% | 39.8% | 44.4% |
| SimpleQA-Verified (Pass@1) | 34.1% | 57.9% | 46.2% | 45.3% | 75.6% |
| Apex Shortlist (Pass@1) | 85.7% | 90.2% | 85.9% | 78.1% | 89.1% |
Key observations:
- Gemini-3.1-Pro leads on most knowledge benchmarks (MMLU-Pro, GPQA Diamond, SimpleQA, HLE)
- V4-Pro-Max leads on Apex Shortlist (90.2%) — a hard reasoning benchmark
- V4-Pro-Max's SimpleQA score (57.9%) significantly beats Opus 4.6 (46.2%) and GPT-5.4 (45.3%) — indicating strong factual recall
Long-Context Benchmarks
| Benchmark | V4-Flash Max | V4-Pro Max | Opus 4.6 Max | Gemini-3.1-Pro High |
|---|---|---|---|---|
| MRCR 1M (MMR) | 78.7% | 83.5% | 92.9% | 76.3% |
| CorpusQA 1M (ACC) | 60.5% | 62.0% | 71.7% | 53.8% |
Analysis:
- V4-Pro beats Gemini-3.1-Pro on CorpusQA 1M (62.0% vs 53.8%)
- Claude Opus 4.6 leads MRCR 1M (92.9% vs 83.5%) — likely due to Claude's architecture optimizations for document retrieval
- Both V4 models comfortably exceed Gemini on CorpusQA, making them strong for RAG workloads
Agentic Task Benchmarks
| Benchmark | V4-Flash Max | V4-Pro Max | Opus 4.6 Max | GPT-5.4 xHigh | Gemini-3.1-Pro High |
|---|---|---|---|---|---|
| Terminal Bench 2.0 (Acc) | 56.9% | 67.9% | 65.4% | 75.1% | 68.5% |
| SWE-bench Verified (Resolved) | 79.0% | 80.6% | 80.8% | N/A | 80.6% |
| SWE-bench Pro (Resolved) | 52.6% | 55.4% | 57.3% | 57.7% | 54.2% |
| BrowseComp (Pass@1) | 73.2% | 83.4% | 83.7% | 82.7% | 85.9% |
| MCPAtlas Public (Pass@1) | 69.0% | 73.6% | 73.8% | 67.2% | 69.2% |
| Toolathlon (Pass@1) | 47.8% | 51.8% | 47.2% | 54.6% | 48.8% |
Standout results:
- SWE-bench Verified: V4-Pro (80.6%) ties Gemini-3.1-Pro (80.6%) and nearly matches Opus 4.6 (80.8%) — remarkable for an open model
- MCPAtlas: V4-Pro (73.6%) nearly matches Opus 4.6 (73.8%), the category leader
- Terminal Bench 2.0: GPT-5.4 leads (75.1%), with V4-Pro behind at 67.9%
Base Model Benchmarks
The V4-Pro-Base (pre-trained, pre-instruction-tuning) results show impressive raw capability:
| Benchmark | DS-V3.2-Base | V4-Flash-Base | V4-Pro-Base |
|---|---|---|---|
| MMLU (EM) | 87.8% | 88.7% | 90.1% |
| MMLU-Redux (EM) | 87.5% | 89.4% | 90.8% |
| GSM8K (EM) | 91.1% | 90.8% | 92.6% |
| HumanEval (Pass@1) | 62.8% | 69.5% | 76.8% |
| LongBench-V2 (EM) | 40.2% | 44.7% | 51.5% |
V4-Pro-Base consistently outperforms both V3.2-Base and V4-Flash-Base across all categories.
Summary: Where DeepSeek V4 Leads vs. Lags
V4-Pro-Max leads the field on:
- Codeforces competitive programming (rating 3206)
- LiveCodeBench (93.5%)
- Apex Shortlist reasoning (90.2%)
- SimpleQA factual recall (57.9%) vs. most non-Gemini models
V4-Pro-Max trails the field on:
- GPQA Diamond (Gemini leads at 94.3%)
- HLE hardest reasoning (Gemini leads at 44.4%)
- MRCR 1M long context (Opus 4.6 leads at 92.9%)
- Terminal Bench 2.0 agentic tasks (GPT-5.4 leads at 75.1%)
For AI-native platforms and tools like Framia.pro where coding, agentic tasks, and long-context comprehension are core use cases, DeepSeek V4-Pro's benchmark profile makes it one of the most compelling choices available in 2026.
Conclusion
DeepSeek V4-Pro is the best open-weight model across almost every benchmark category, and it competes meaningfully with every closed-source frontier model. Its most exceptional performance is in competitive coding, where it outperforms all other models tested. It trails slightly on the very hardest scientific reasoning and long-document retrieval tasks, but the gaps are narrowing.