DeepSeek V4 Model Card: Full Technical Reference for Developers

Complete DeepSeek V4 model card: full specs, API reference, pricing, benchmark table, local deployment guide, and technical notes for V4-Pro and V4-Flash.

by Framia

DeepSeek V4 Model Card: Full Technical Reference for Developers

The DeepSeek V4 model card consolidates everything a developer needs to understand and deploy the V4 series. This reference covers the full technical specifications, access methods, known limitations, and usage guidelines for both V4-Pro and V4-Flash.


Model Identity

Field DeepSeek-V4-Pro DeepSeek-V4-Flash
Model ID deepseek-v4-pro deepseek-v4-flash
Developer DeepSeek-AI (Hangzhou DeepSeek Artificial Intelligence Co., Ltd.)
Release Date April 24, 2026 (Preview)
License MIT License
Model Type Decoder-only Transformer, MoE
Architecture Hybrid Attention (CSA + HCA) + mHC
Total Parameters 1.6T 284B
Active Parameters 49B 13B
Context Length 1,000,000 tokens 1,000,000 tokens
Precision FP4 + FP8 Mixed FP4 + FP8 Mixed
Download Size ~865 GB ~160 GB

HuggingFace Repository Map

Repository Type URL
DeepSeek-V4-Pro Instruct (RLHF-tuned) huggingface.co/deepseek-ai/DeepSeek-V4-Pro
DeepSeek-V4-Pro-Base Pre-trained base huggingface.co/deepseek-ai/DeepSeek-V4-Pro-Base
DeepSeek-V4-Flash Instruct (RLHF-tuned) huggingface.co/deepseek-ai/DeepSeek-V4-Flash
DeepSeek-V4-Flash-Base Pre-trained base huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base

API Reference

Endpoints

  • Base URL: https://api.deepseek.com/v1
  • Chat Completions: POST /chat/completions
  • Compatible formats: OpenAI ChatCompletions API, Anthropic Messages API

Model Names (API)

  • deepseek-v4-pro — Full capability flagship
  • deepseek-v4-flash — Fast and cost-efficient

⚠️ Deprecated (retiring July 24, 2026): deepseek-chat, deepseek-reasoner

Pricing

Model Input Output
deepseek-v4-flash $0.14 / 1M tokens $0.28 / 1M tokens
deepseek-v4-pro $1.74 / 1M tokens $3.48 / 1M tokens

Architecture Details

Hybrid Attention System

Layer Type Mechanism Purpose
Recent token layers Standard attention Full fidelity for nearby context
Mid-range token layers Compressed Sparse Attention (CSA) Efficient access to moderate-distance context
Long-range token layers Heavily Compressed Attention (HCA) Compact representation of distant history

Efficiency vs V3.2 at 1M context:

  • FLOPs: 27% of V3.2 (73% reduction)
  • KV Cache: 10% of V3.2 (90% reduction)

Training Innovations

Innovation Description
Optimizer Muon (replaces AdamW)
Residual connections mHC (Manifold-Constrained Hyper-Connections)
Pre-training data 32T+ diverse tokens
Post-training Stage 1 Expert specialization via SFT + RL (GRPO)
Post-training Stage 2 Unified consolidation via on-policy distillation

Inference Modes

Mode API Parameter Thinking Budget Context Requirement
Non-think "thinking": {"type": "disabled"} None Standard
Think High "thinking": {"type": "enabled", "budget_tokens": N} User-defined Standard
Think Max Special system prompt + "thinking": {"type": "max"} Extended 384K+ tokens recommended

{
  "temperature": 1.0,
  "top_p": 1.0
}

Benchmark Reference

V4-Pro-Max vs Frontier Models

Benchmark V4-Pro Max Opus 4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High
MMLU-Pro 87.5% 89.1% 87.5% 91.0%
GPQA Diamond 90.1% 91.3% 93.0% 94.3%
HLE 37.7% 40.0% 39.8% 44.4%
LiveCodeBench 93.5% 88.8% N/A 91.7%
Codeforces 3206 N/A 3168 3052
SWE-bench Verified 80.6% 80.8% N/A 80.6%
SWE-bench Pro 55.4% 57.3% 57.7% 54.2%
Terminal Bench 2.0 67.9% 65.4% 75.1% 68.5%
MRCR 1M 83.5% 92.9% N/A 76.3%
CorpusQA 1M 62.0% 71.7% N/A 53.8%

Local Deployment Reference

Configuration Storage VRAM Min GPU Setup
V4-Flash (Full) 160 GB ~160 GB 2× H100 80GB
V4-Flash (Q4 quant) ~80 GB ~80 GB RTX 5090
V4-Pro (Full) 865 GB ~865 GB 16× H100 80GB
V4-Pro (Q4 quant) ~200–400 GB ~200–400 GB 4–8× H100 80GB

Chat Template

DeepSeek V4 does not use a standard HuggingFace Jinja chat template. Use the custom encoding scripts in each repository's encoding/ folder.

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

prompt = encode_messages(messages, thinking_mode="no_think")  
# Options: "no_think", "thinking", "max_thinking"

Known Limitations

  • Text-only at launch: No native image, audio, or video understanding in the April 2026 preview release
  • Preview status: Edge cases may exist; DeepSeek recommends relying on official accounts for updates
  • Think Max context requirement: 384K+ token context window required for best Think Max performance
  • Large download: V4-Pro at 865 GB requires significant bandwidth and storage for local deployment
  • Chat template: Non-standard encoding requires using repository-provided scripts rather than standard HuggingFace pipeline tools

Contact and Support

  • Official Twitter: @deepseek_ai
  • GitHub: github.com/deepseek-ai
  • HuggingFace: huggingface.co/deepseek-ai
  • API Documentation: api-docs.deepseek.com
  • Email: service@deepseek.com
  • Web Chat: chat.deepseek.com

For developers building on platforms like Framia.pro that integrate DeepSeek V4's capabilities, this model card serves as the authoritative technical reference for all integration decisions.


Citation

@misc{deepseekai2026deepseekv4,
  title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
  author={DeepSeek-AI},
  year={2026},
}