Complete DeepSeek V4 model card: full specs, API reference, pricing, benchmark table, local deployment guide, and technical notes for V4-Pro and V4-Flash.
DeepSeek V4 Model Card: Full Technical Reference for Developers
The DeepSeek V4 model card consolidates everything a developer needs to understand and deploy the V4 series. This reference covers the full technical specifications, access methods, known limitations, and usage guidelines for both V4-Pro and V4-Flash.
Model Identity
| Field |
DeepSeek-V4-Pro |
DeepSeek-V4-Flash |
| Model ID |
deepseek-v4-pro |
deepseek-v4-flash |
| Developer |
DeepSeek-AI (Hangzhou DeepSeek Artificial Intelligence Co., Ltd.) |
|
| Release Date |
April 24, 2026 (Preview) |
|
| License |
MIT License |
|
| Model Type |
Decoder-only Transformer, MoE |
|
| Architecture |
Hybrid Attention (CSA + HCA) + mHC |
|
| Total Parameters |
1.6T |
284B |
| Active Parameters |
49B |
13B |
| Context Length |
1,000,000 tokens |
1,000,000 tokens |
| Precision |
FP4 + FP8 Mixed |
FP4 + FP8 Mixed |
| Download Size |
~865 GB |
~160 GB |
HuggingFace Repository Map
| Repository |
Type |
URL |
| DeepSeek-V4-Pro |
Instruct (RLHF-tuned) |
huggingface.co/deepseek-ai/DeepSeek-V4-Pro |
| DeepSeek-V4-Pro-Base |
Pre-trained base |
huggingface.co/deepseek-ai/DeepSeek-V4-Pro-Base |
| DeepSeek-V4-Flash |
Instruct (RLHF-tuned) |
huggingface.co/deepseek-ai/DeepSeek-V4-Flash |
| DeepSeek-V4-Flash-Base |
Pre-trained base |
huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base |
API Reference
Endpoints
- Base URL:
https://api.deepseek.com/v1
- Chat Completions:
POST /chat/completions
- Compatible formats: OpenAI ChatCompletions API, Anthropic Messages API
Model Names (API)
deepseek-v4-pro — Full capability flagship
deepseek-v4-flash — Fast and cost-efficient
⚠️ Deprecated (retiring July 24, 2026): deepseek-chat, deepseek-reasoner
Pricing
| Model |
Input |
Output |
| deepseek-v4-flash |
$0.14 / 1M tokens |
$0.28 / 1M tokens |
| deepseek-v4-pro |
$1.74 / 1M tokens |
$3.48 / 1M tokens |
Architecture Details
Hybrid Attention System
| Layer Type |
Mechanism |
Purpose |
| Recent token layers |
Standard attention |
Full fidelity for nearby context |
| Mid-range token layers |
Compressed Sparse Attention (CSA) |
Efficient access to moderate-distance context |
| Long-range token layers |
Heavily Compressed Attention (HCA) |
Compact representation of distant history |
Efficiency vs V3.2 at 1M context:
- FLOPs: 27% of V3.2 (73% reduction)
- KV Cache: 10% of V3.2 (90% reduction)
Training Innovations
| Innovation |
Description |
| Optimizer |
Muon (replaces AdamW) |
| Residual connections |
mHC (Manifold-Constrained Hyper-Connections) |
| Pre-training data |
32T+ diverse tokens |
| Post-training Stage 1 |
Expert specialization via SFT + RL (GRPO) |
| Post-training Stage 2 |
Unified consolidation via on-policy distillation |
Inference Modes
| Mode |
API Parameter |
Thinking Budget |
Context Requirement |
| Non-think |
"thinking": {"type": "disabled"} |
None |
Standard |
| Think High |
"thinking": {"type": "enabled", "budget_tokens": N} |
User-defined |
Standard |
| Think Max |
Special system prompt + "thinking": {"type": "max"} |
Extended |
384K+ tokens recommended |
Recommended Sampling Parameters
{
"temperature": 1.0,
"top_p": 1.0
}
Benchmark Reference
V4-Pro-Max vs Frontier Models
| Benchmark |
V4-Pro Max |
Opus 4.6 Max |
GPT-5.4 xHigh |
Gemini-3.1-Pro High |
| MMLU-Pro |
87.5% |
89.1% |
87.5% |
91.0% |
| GPQA Diamond |
90.1% |
91.3% |
93.0% |
94.3% |
| HLE |
37.7% |
40.0% |
39.8% |
44.4% |
| LiveCodeBench |
93.5% |
88.8% |
N/A |
91.7% |
| Codeforces |
3206 |
N/A |
3168 |
3052 |
| SWE-bench Verified |
80.6% |
80.8% |
N/A |
80.6% |
| SWE-bench Pro |
55.4% |
57.3% |
57.7% |
54.2% |
| Terminal Bench 2.0 |
67.9% |
65.4% |
75.1% |
68.5% |
| MRCR 1M |
83.5% |
92.9% |
N/A |
76.3% |
| CorpusQA 1M |
62.0% |
71.7% |
N/A |
53.8% |
Local Deployment Reference
| Configuration |
Storage |
VRAM |
Min GPU Setup |
| V4-Flash (Full) |
160 GB |
~160 GB |
2× H100 80GB |
| V4-Flash (Q4 quant) |
~80 GB |
~80 GB |
RTX 5090 |
| V4-Pro (Full) |
865 GB |
~865 GB |
16× H100 80GB |
| V4-Pro (Q4 quant) |
~200–400 GB |
~200–400 GB |
4–8× H100 80GB |
Chat Template
DeepSeek V4 does not use a standard HuggingFace Jinja chat template. Use the custom encoding scripts in each repository's encoding/ folder.
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
prompt = encode_messages(messages, thinking_mode="no_think")
# Options: "no_think", "thinking", "max_thinking"
Known Limitations
- Text-only at launch: No native image, audio, or video understanding in the April 2026 preview release
- Preview status: Edge cases may exist; DeepSeek recommends relying on official accounts for updates
- Think Max context requirement: 384K+ token context window required for best Think Max performance
- Large download: V4-Pro at 865 GB requires significant bandwidth and storage for local deployment
- Chat template: Non-standard encoding requires using repository-provided scripts rather than standard HuggingFace pipeline tools
- Official Twitter: @deepseek_ai
- GitHub: github.com/deepseek-ai
- HuggingFace: huggingface.co/deepseek-ai
- API Documentation: api-docs.deepseek.com
- Email: service@deepseek.com
- Web Chat: chat.deepseek.com
For developers building on platforms like Framia.pro that integrate DeepSeek V4's capabilities, this model card serves as the authoritative technical reference for all integration decisions.
Citation
@misc{deepseekai2026deepseekv4,
title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
author={DeepSeek-AI},
year={2026},
}