DeepSeek V4 Training: How the Model Was Built
Understanding how DeepSeek V4 was trained provides insight into why it performs the way it does — and what architectural and data decisions led to a model that outperforms expectations on coding, reasoning, and long-context tasks. This guide covers V4's pre-training, post-training, and the key innovations that distinguish it from its predecessors.
Pre-Training: Scale and Data
Dataset Size: 32 Trillion+ Tokens
Both V4-Pro and V4-Flash were pre-trained on more than 32 trillion tokens of diverse, high-quality data. This is nearly double DeepSeek-V3's estimated 18T pre-training tokens, representing a substantial investment in training compute and data curation.
The training corpus spans:
- Natural language — web text, books, articles in dozens of languages
- Code — source code across all major programming languages
- Mathematics — formal proofs, competition problems, textbooks
- Scientific literature — research papers across STEM disciplines
- Multilingual content — strong multilingual coverage evidenced by MMMLU 90.3%
Why More Data Matters
The relationship between training data scale and model capability isn't linear — but at frontier scales, more high-quality, diverse data consistently improves knowledge breadth, factual accuracy, and generalization.
V4-Pro-Base's dramatic jump in SimpleQA-Verified (55.2% vs V3.2's 28.3%) reflects the data scale increase combined with improved data curation — the model has simply seen more of the world's knowledge.
The Muon Optimizer
DeepSeek replaced the standard AdamW optimizer with the Muon Optimizer for V4.
What Muon Does
Standard Adam-based optimizers update parameters based on gradient direction and magnitude. Muon adds an orthogonalization step: before applying the gradient update, it removes correlations between the current update and previous update directions.
The result:
- Faster convergence: More useful information is extracted from each training step
- Greater stability: Orthogonalized updates are less likely to cause oscillation or divergence
- Better scaling: Muon's stability properties are particularly valuable at V4's scale (1.6T parameters)
Think of it as a more efficient parameter space exploration: Muon prevents redundant steps by ensuring each gradient update moves in a genuinely new direction.
Architectural Innovations During Pre-Training
Hybrid Attention Architecture (CSA + HCA)
Unlike V3.2 which used Multi-head Latent Attention (MLA), V4 pre-trains from scratch with the Hybrid Attention Architecture. This means the model's internal representations are shaped by the CSA + HCA mechanism from the very beginning — not retro-fitted onto an older architecture.
This is why V4 handles 1M-token context more naturally: the attention patterns learned during pre-training are optimized for the hierarchical compression structure.
Manifold-Constrained Hyper-Connections (mHC)
mHC replaces standard residual connections throughout the network. Pre-training with mHC from the start means the model's weight matrices develop within a structurally constrained space that promotes stable signal propagation.
The practical effect: training a 1.6T-parameter model with 32T+ tokens is feasible without catastrophic instabilities that plague attempts to scale standard architectures to this size.
Post-Training: The Two-Stage Pipeline
Stage 1: Independent Expert Cultivation
The MoE architecture's individual experts are trained independently for domain specialization:
Supervised Fine-Tuning (SFT):
- High-quality labeled examples in each expert's domain
- Teaches the model to follow instructions accurately in each specialty
- Covers coding, mathematics, science, language, general knowledge, safety
Reinforcement Learning with GRPO:
- Group Relative Policy Optimization rewards the model for generating better responses relative to a group of samples
- Applied independently to each domain/expert
- Shapes expert behavior toward human preferences without needing a separate reward model
Stage 2: Unified Model Consolidation
After stage 1, the independently trained experts are integrated into a unified model through on-policy distillation:
- The stage-1 specialized model generates outputs on diverse tasks
- The final model is trained to match (distill) these outputs
- The routing mechanism learns to activate the right experts for each task
This consolidation phase is what gives V4-Pro its unusual combination of deep capability across very different domains — each expert is genuinely specialized, and the router has learned to use them appropriately.
Hardware: Huawei Ascend 950PR
One of the most significant facts about V4's training is the hardware:
V4 was trained on Huawei Ascend 950PR chips — not NVIDIA's A100s or H100s.
This has several implications:
Technical: The Huawei Ascend 950PR is a high-performance AI accelerator with competitive training throughput for large-scale models. V4's results demonstrate that frontier AI training is achievable on this hardware.
Geopolitical: US export restrictions limit Chinese companies' access to NVIDIA's most advanced chips. DeepSeek's success training V4 on Ascend hardware demonstrates that China's domestic AI chip capabilities are higher than many assumed.
Strategic: By building on domestic hardware, DeepSeek (and by extension, China's AI ecosystem) reduces dependency on US-controlled supply chains for frontier AI development.
Post-Training Alignment
After the two-stage RLHF pipeline, V4 undergoes safety-focused alignment tuning:
- Additional SFT examples covering safety-relevant scenarios
- Constitutional-style guidelines baked into instruction following
- Multi-language safety alignment across V4's supported languages
The exact scope of DeepSeek's safety post-training is not fully documented in the public technical report, but standard industry practice (and DeepSeek's track record with previous models) suggests comprehensive coverage of common harmful use cases.
Training Cost: The Efficiency Story
DeepSeek has previously been celebrated for achieving frontier results at dramatically lower reported training costs than western competitors. V4's training cost has not been officially disclosed, but several factors suggest continued efficiency advantages:
- Muon optimizer: Fewer wasted gradient steps
- mHC stability: Less compute lost to training instability
- MoE sparsity: Only 49B active parameters per token, not 1.6T
- Ascend 950PR optimization: Purpose-built for this type of training
The combination of architectural and optimizer improvements means V4 extracts more capability per training FLOP than previous approaches.
From V3.2 to V4: What Changed in Training
| Training Aspect | V3.2 | V4 |
|---|---|---|
| Optimizer | AdamW variant | Muon |
| Residual connections | Standard | mHC |
| Attention mechanism | MLA | Hybrid (CSA + HCA) |
| Pre-training tokens | ~18T | 32T+ |
| Post-training pipeline | SFT + RL | Two-stage: specialization + consolidation |
| Hardware | NVIDIA (H800 equivalent) | Huawei Ascend 950PR |
Implications for the Field
V4's training methodology — particularly the Muon optimizer, mHC, and two-stage post-training — are openly documented in the technical report and available for the research community to study and replicate. DeepSeek's transparency here reflects its research-first culture.
Platforms like Framia.pro that integrate frontier AI capabilities benefit from this knowledge-sharing culture: as these training techniques get replicated and refined across the ecosystem, the quality ceiling for AI models continues to rise, improving every downstream application.
Conclusion
DeepSeek V4 was built through a combination of unprecedented data scale (32T+ tokens), architectural innovation (Hybrid Attention, mHC), optimizer improvements (Muon), and a novel two-stage post-training pipeline. The result is a model that achieves frontier-class performance on a domestic Chinese hardware stack — a landmark achievement that establishes V4 as both a technical and strategic milestone in AI development.