DeepSeek V4 Training: How the Model Was Built (2026)

How DeepSeek V4 was trained: 32T+ tokens, Muon optimizer, mHC, Hybrid Attention, two-stage post-training on Huawei Ascend 950PR. Complete training methodology breakdown.

DeepSeek V4 Training: How the Model Was Built

Understanding how DeepSeek V4 was trained provides insight into why it performs the way it does — and what architectural and data decisions led to a model that outperforms expectations on coding, reasoning, and long-context tasks. This guide covers V4's pre-training, post-training, and the key innovations that distinguish it from its predecessors.

Pre-Training: Scale and Data

Dataset Size: 32 Trillion+ Tokens

Both V4-Pro and V4-Flash were pre-trained on more than 32 trillion tokens of diverse, high-quality data. This is nearly double DeepSeek-V3's estimated 18T pre-training tokens, representing a substantial investment in training compute and data curation.

The training corpus spans:

Natural language — web text, books, articles in dozens of languages
Code — source code across all major programming languages
Mathematics — formal proofs, competition problems, textbooks
Scientific literature — research papers across STEM disciplines
Multilingual content — strong multilingual coverage evidenced by MMMLU 90.3%

Why More Data Matters

The relationship between training data scale and model capability isn't linear — but at frontier scales, more high-quality, diverse data consistently improves knowledge breadth, factual accuracy, and generalization.

V4-Pro-Base's dramatic jump in SimpleQA-Verified (55.2% vs V3.2's 28.3%) reflects the data scale increase combined with improved data curation — the model has simply seen more of the world's knowledge.

The Muon Optimizer

DeepSeek replaced the standard AdamW optimizer with the Muon Optimizer for V4.

What Muon Does

Standard Adam-based optimizers update parameters based on gradient direction and magnitude. Muon adds an orthogonalization step: before applying the gradient update, it removes correlations between the current update and previous update directions.

The result:

Faster convergence: More useful information is extracted from each training step
Greater stability: Orthogonalized updates are less likely to cause oscillation or divergence
Better scaling: Muon's stability properties are particularly valuable at V4's scale (1.6T parameters)

Think of it as a more efficient parameter space exploration: Muon prevents redundant steps by ensuring each gradient update moves in a genuinely new direction.

Architectural Innovations During Pre-Training

Hybrid Attention Architecture (CSA + HCA)

Unlike V3.2 which used Multi-head Latent Attention (MLA), V4 pre-trains from scratch with the Hybrid Attention Architecture. This means the model's internal representations are shaped by the CSA + HCA mechanism from the very beginning — not retro-fitted onto an older architecture.

This is why V4 handles 1M-token context more naturally: the attention patterns learned during pre-training are optimized for the hierarchical compression structure.

Manifold-Constrained Hyper-Connections (mHC)

mHC replaces standard residual connections throughout the network. Pre-training with mHC from the start means the model's weight matrices develop within a structurally constrained space that promotes stable signal propagation.

The practical effect: training a 1.6T-parameter model with 32T+ tokens is feasible without catastrophic instabilities that plague attempts to scale standard architectures to this size.

Post-Training: The Two-Stage Pipeline

Stage 1: Independent Expert Cultivation

The MoE architecture's individual experts are trained independently for domain specialization:

Supervised Fine-Tuning (SFT):

High-quality labeled examples in each expert's domain
Teaches the model to follow instructions accurately in each specialty
Covers coding, mathematics, science, language, general knowledge, safety

Reinforcement Learning with GRPO:

Group Relative Policy Optimization rewards the model for generating better responses relative to a group of samples
Applied independently to each domain/expert
Shapes expert behavior toward human preferences without needing a separate reward model

Stage 2: Unified Model Consolidation

After stage 1, the independently trained experts are integrated into a unified model through on-policy distillation:

The stage-1 specialized model generates outputs on diverse tasks
The final model is trained to match (distill) these outputs
The routing mechanism learns to activate the right experts for each task

This consolidation phase is what gives V4-Pro its unusual combination of deep capability across very different domains — each expert is genuinely specialized, and the router has learned to use them appropriately.

Hardware: Huawei Ascend 950PR

One of the most significant facts about V4's training is the hardware:

V4 was trained on Huawei Ascend 950PR chips — not NVIDIA's A100s or H100s.

This has several implications:

Technical: The Huawei Ascend 950PR is a high-performance AI accelerator with competitive training throughput for large-scale models. V4's results demonstrate that frontier AI training is achievable on this hardware.

Geopolitical: US export restrictions limit Chinese companies' access to NVIDIA's most advanced chips. DeepSeek's success training V4 on Ascend hardware demonstrates that China's domestic AI chip capabilities are higher than many assumed.

Strategic: By building on domestic hardware, DeepSeek (and by extension, China's AI ecosystem) reduces dependency on US-controlled supply chains for frontier AI development.

Post-Training Alignment

After the two-stage RLHF pipeline, V4 undergoes safety-focused alignment tuning:

Additional SFT examples covering safety-relevant scenarios
Constitutional-style guidelines baked into instruction following
Multi-language safety alignment across V4's supported languages

The exact scope of DeepSeek's safety post-training is not fully documented in the public technical report, but standard industry practice (and DeepSeek's track record with previous models) suggests comprehensive coverage of common harmful use cases.

Training Cost: The Efficiency Story

DeepSeek has previously been celebrated for achieving frontier results at dramatically lower reported training costs than western competitors. V4's training cost has not been officially disclosed, but several factors suggest continued efficiency advantages:

Muon optimizer: Fewer wasted gradient steps
mHC stability: Less compute lost to training instability
MoE sparsity: Only 49B active parameters per token, not 1.6T
Ascend 950PR optimization: Purpose-built for this type of training

The combination of architectural and optimizer improvements means V4 extracts more capability per training FLOP than previous approaches.

From V3.2 to V4: What Changed in Training

Training Aspect	V3.2	V4
Optimizer	AdamW variant	Muon
Residual connections	Standard	mHC
Attention mechanism	MLA	Hybrid (CSA + HCA)
Pre-training tokens	~18T	32T+
Post-training pipeline	SFT + RL	Two-stage: specialization + consolidation
Hardware	NVIDIA (H800 equivalent)	Huawei Ascend 950PR

Implications for the Field

V4's training methodology — particularly the Muon optimizer, mHC, and two-stage post-training — are openly documented in the technical report and available for the research community to study and replicate. DeepSeek's transparency here reflects its research-first culture.

Platforms like Framia.pro that integrate frontier AI capabilities benefit from this knowledge-sharing culture: as these training techniques get replicated and refined across the ecosystem, the quality ceiling for AI models continues to rise, improving every downstream application.

Conclusion

DeepSeek V4 was built through a combination of unprecedented data scale (32T+ tokens), architectural innovation (Hybrid Attention, mHC), optimizer improvements (Muon), and a novel two-stage post-training pipeline. The result is a model that achieves frontier-class performance on a domestic Chinese hardware stack — a landmark achievement that establishes V4 as both a technical and strategic milestone in AI development.