DeepSeek V4 Model Architecture: Hybrid Attention, mHC, and MoE Explained
DeepSeek V4 is not just a bigger version of V3. It introduces a set of fundamental architectural changes that dramatically improve efficiency, especially for long-context workloads. If you want to understand why DeepSeek V4 can handle 1 million tokens as a default — and do so with far less compute than any previous model — this guide walks you through every major innovation.
Overview: Four Core Architectural Pillars
- Mixture of Experts (MoE) — sparse activation for compute efficiency
- Hybrid Attention Architecture (CSA + HCA) — the key to 1M-token efficiency
- Manifold-Constrained Hyper-Connections (mHC) — stable signal propagation
- Muon Optimizer — faster, more stable training
Let's dig into each one.
1. Mixture of Experts (MoE)
DeepSeek V4 uses an MoE architecture across both Pro (1.6T / 49B active) and Flash (284B / 13B active) variants. MoE works by splitting the model's feed-forward layers into many specialized "experts," with a trainable router selecting the most relevant experts for each token.
Why it matters: You get the knowledge capacity of a model with hundreds of billions or trillions of parameters, but only activate a small subset of them per token. Inference cost scales with active parameters, not total parameters — making MoE dramatically more compute-efficient than equivalent dense models.
DeepSeek's post-training pipeline adds a distinctive two-stage approach:
- Stage 1: Independent expert specialization via SFT and RL with GRPO
- Stage 2: Unified model consolidation via on-policy distillation — merging all specialized expertise into a single coherent model
2. Hybrid Attention Architecture: CSA + HCA
This is DeepSeek V4's most significant innovation, and the reason 1 million tokens is now the default context length.
The Problem with Standard Attention at Long Context
Standard transformer attention (like in older models) scales quadratically with sequence length. For 1 million tokens, this would require an astronomical amount of memory (KV cache) and compute — making it impractical.
DeepSeek's Solution: Two Complementary Attention Mechanisms
Compressed Sparse Attention (CSA)
- Applies token-wise compression, reducing the number of key-value pairs that need to be stored and retrieved
- Allows the model to efficiently access distant context without storing the full sequence at full resolution
Heavily Compressed Attention (HCA)
- Goes further, applying aggressive compression to tokens that are very distant from the current position
- Essentially tells the model: "for tokens far back in history, store a highly compressed summary — don't try to remember every detail"
Together, CSA and HCA create a tiered memory system: recent tokens get full attention, somewhat distant tokens get compressed attention, and very distant tokens get heavily compressed attention. This mirrors how human working memory actually operates.
The Result: Spectacular Efficiency Gains
In a 1M-token context scenario:
- V4-Pro requires only 27% of single-token inference FLOPs vs V3.2
- V4-Pro requires only 10% of KV cache memory vs V3.2
That's roughly a 3.7× reduction in compute and a 10× reduction in memory — enabling 1M-token context on hardware that would have been impossible for V3.2.
3. Manifold-Constrained Hyper-Connections (mHC)
As models scale to trillions of parameters across hundreds of layers, a common failure mode is gradient degradation — signals becoming too weak or too noisy to propagate effectively through deep networks.
DeepSeek's solution is mHC (Manifold-Constrained Hyper-Connections), which enhances conventional residual connections by constraining weight updates to lie on a Riemannian manifold. In plain terms, mHC:
- Strengthens the residual pathway between transformer layers
- Stabilizes how signals flow through the network's depth
- Preserves model expressivity while preventing gradient explosion or vanishing
The practical effect: the 1.6T-parameter V4-Pro can be trained reliably at a scale that would destabilize most other architectures.
4. Muon Optimizer
DeepSeek V4 replaces the standard AdamW optimizer with the Muon Optimizer (short for Momentum + Orthogonalization). Muon:
- Applies an orthogonalization step to the gradient updates, preventing redundant updates along correlated directions
- Achieves faster convergence — the model learns more from each training step
- Provides greater training stability — particularly important at the 32T+ token pre-training scale
Both V4-Pro and V4-Flash were pre-trained on more than 32 trillion diverse, high-quality tokens using Muon, giving the models exceptional coverage of world knowledge, code, math, and multilingual text.
Three Reasoning Effort Modes: Architecture Meets Inference
The architecture enables a flexible three-mode inference system:
| Mode | Behavior | Use Case |
|---|---|---|
| Non-think | No explicit chain-of-thought | Fast queries, simple tasks |
| Think High | Controlled chain-of-thought | Complex reasoning, planning |
| Think Max | Extended, exhaustive reasoning | Competition math, frontier coding |
Think Max requires at least a 384K-token context window to work well (the model needs space for its full reasoning trace). This is trivially available within V4's 1M-token limit.
How It Compares to DeepSeek V3.2's Architecture
DeepSeek-V3.2 used 671B total / 37B active parameters and a different attention scheme. Moving to V4:
- Total params nearly tripled (671B → 1.6T for Pro)
- Active params grew from 37B → 49B
- KV cache reduced by 10× for 1M-token context
- Compute per token reduced by ~73%
- New optimizer (Muon vs. AdamW variant)
- New training pipeline (two-stage expert consolidation)
For platforms like Framia.pro that power AI agents at scale, architectural efficiency improvements like these translate directly into lower costs, faster responses, and more capable creative workflows.
Conclusion
DeepSeek V4's architecture is a carefully engineered combination of MoE sparsity, hybrid attention compression, manifold-constrained residual connections, and an advanced optimizer. Together, these innovations make 1-million-token context not just theoretically possible, but practically default — at a cost that makes it accessible to developers, researchers, and enterprises worldwide.