DeepSeek V4 Paper: Key Technical Findings From the Official Report

Summary of the DeepSeek V4 technical paper: Hybrid Attention (CSA+HCA), mHC, Muon optimizer, two-stage post-training, and all key benchmark findings explained.

by Framia

DeepSeek V4 Paper: Key Technical Findings From the Official Report

DeepSeek released the full technical report for DeepSeek V4 alongside the model weights on April 24, 2026. Titled "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence", it's a comprehensive academic document covering the model's architecture, training methodology, and evaluation results.

This article summarizes the most important technical findings for researchers, engineers, and technically curious practitioners.


Paper Overview

Title: DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
Authors: DeepSeek-AI
Year: 2026
Available at: huggingface.co/deepseek-ai/DeepSeek-V4-Pro (in the repository files as DeepSeek_V4.pdf)

The paper introduces the V4 series — DeepSeek-V4-Pro (1.6T / 49B active) and DeepSeek-V4-Flash (284B / 13B active) — and details three major innovations: the Hybrid Attention Architecture, mHC (Manifold-Constrained Hyper-Connections), and the Muon Optimizer.


Finding 1: The 1M-Token Context Problem and Its Solution

The paper's central contribution is solving the challenge of making 1-million-token context practical — not just theoretically possible.

The problem: Standard attention mechanisms scale quadratically with sequence length. At 1M tokens, standard attention would require:

  • Orders of magnitude more compute per token
  • Impractically large KV cache memory

The solution — Hybrid Attention Architecture: The paper proposes combining two complementary attention mechanisms:

Compressed Sparse Attention (CSA):

  • Applies token-wise key-value compression
  • Maintains high fidelity for recent and relevant tokens
  • Reduces attention overhead for moderately distant context

Heavily Compressed Attention (HCA):

  • Applies aggressive compression to very distant tokens
  • Essentially creates compact representations of distant history
  • Enables the model to "remember" over very long horizons at minimal cost

Quantified result: In the 1M-token context setting, V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2. This is the paper's most significant practical contribution.


Finding 2: Manifold-Constrained Hyper-Connections (mHC)

Standard residual connections in deep transformers can suffer from gradient degradation as network depth increases. The paper introduces mHC to address this.

The innovation: mHC constrains weight updates to lie on a Riemannian manifold — a smooth geometric space. This:

  • Strengthens signal propagation across layers
  • Prevents gradient explosion/vanishing in very deep networks
  • Preserves model expressivity while improving stability

Practical effect: mHC enables reliable training at 1.6 trillion parameters. Without this stability improvement, scaling to that parameter count with the Hybrid Attention Architecture would be significantly more challenging.


Finding 3: The Muon Optimizer

The paper details the adoption of the Muon Optimizer to replace standard AdamW-based training.

Muon works by orthogonalizing gradient updates — removing correlations between update directions:

  • Gradient steps are more independent
  • Convergence is faster: the model learns more per training step
  • Training is more stable at very large scale

Combined with pre-training on 32T+ diverse tokens, Muon produces models with strong coverage across world knowledge, code, math, science, and multilingual text.


Finding 4: Two-Stage Post-Training Pipeline

One of the paper's more novel contributions is the post-training methodology:

Stage 1: Independent Expert Cultivation

  • Each MoE expert is trained independently on its specialization domain
  • Uses SFT (Supervised Fine-Tuning) + RL with GRPO (Group Relative Policy Optimization)
  • Each expert develops deep, narrow proficiency

Stage 2: Unified Model Consolidation

  • On-policy distillation integrates the diverse expert proficiencies into a single model
  • The final model has access to all domain expertise without needing to switch between separate models

This pipeline explains why V4-Pro shows unusually strong performance across very different task types simultaneously — deep world knowledge AND frontier coding AND long-context retrieval.


Finding 5: MoE Architecture Details

The paper describes the MoE implementation in detail:

V4-Pro Expert Configuration:

  • 1.6T total parameters across all experts
  • 49B activated per token
  • Router selects relevant experts per token using learned routing weights
  • Expert parameters stored in FP4 precision (most other weights in FP8)

V4-Flash:

  • 284B total / 13B active
  • Same architectural innovations but at smaller scale
  • Uses same FP4 + FP8 mixed precision scheme

The paper notes that V4-Flash, despite being smaller than V3.2 (671B / 37B), achieves comparable or better performance on most benchmarks — demonstrating the efficiency gains from the new architecture.


Finding 6: Base Model Evaluation

The paper provides extensive base model (pre-instruction-tuning) benchmark results, establishing that V4-Pro's capabilities emerge strongly from pre-training:

Key base model results (V4-Pro-Base vs V3.2-Base):

  • MMLU: 90.1% vs 87.8% (+2.3pp)
  • MMLU-Redux: 90.8% vs 87.5% (+3.3pp)
  • Simple-QA verified: 55.2% vs 28.3% (+26.9pp — massive jump)
  • HumanEval: 76.8% vs 62.8% (+14pp)
  • LongBench-V2: 51.5% vs 40.2% (+11.3pp)

The Simple-QA verified jump (+26.9pp) is particularly striking — indicating fundamental improvements in world knowledge grounding at the base model level.


Finding 7: Three-Mode Inference System

The paper introduces the three-mode reasoning framework as a first-class architectural feature:

Non-think: The model generates direct responses without an explicit chain-of-thought
Think High: A controlled thinking process with budgeted token allocation
Think Max: Extended reasoning with a special system prompt, requiring 384K+ tokens of context headroom

The paper demonstrates that Think Max significantly closes the gap with closed-source frontier models on hard reasoning benchmarks — suggesting that reasoning depth, not just parameter count, is a key determinant of performance on complex tasks.


Finding 8: Agentic Performance

The paper emphasizes DeepSeek's focus on agentic capabilities, reporting strong results on:

  • SWE-bench Verified: 80.6% (matches Gemini-3.1-Pro, nearly matches Claude Opus 4.6)
  • Terminal Bench 2.0: 67.9% (competitive with best open models)
  • MCPAtlas: 73.6% (near SOTA)

The paper also notes integration with Claude Code, OpenClaw, and OpenCode as first-class supported deployment environments.


Citation

For academic use:

@misc{deepseekai2026deepseekv4,
  title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
  author={DeepSeek-AI},
  year={2026},
}

Conclusion

The DeepSeek V4 technical report is a dense, high-quality academic document that genuinely advances the field. Its core contributions — Hybrid Attention (CSA + HCA), mHC, and the two-stage post-training pipeline — are concrete, reproducible innovations that the broader AI research community can study and build upon. Platforms like Framia.pro that leverage frontier AI models benefit directly from the architectural advances documented in papers like this, which drive both capability improvements and cost reductions across the ecosystem.