How to Run DeepSeek V4 Locally: Hardware Requirements and Setup Guide
Running DeepSeek V4 locally gives you complete privacy, no per-token API costs, and full control over inference settings. Both V4-Pro and V4-Flash are MIT-licensed open-weight models available for free download from HuggingFace. Here's everything you need to know to run them on your own hardware.
Should You Run V4 Locally or Use the API?
Before diving into setup, consider your use case:
| Factor | Local Deployment | API |
|---|---|---|
| Cost (high volume) | ✅ Lower (hardware amortized) | ❌ Per-token fees |
| Privacy | ✅ Complete | ❌ Data sent to DeepSeek |
| Setup complexity | ❌ High | ✅ Zero |
| Latency | ✅ No network round-trip | ❌ Network dependent |
| Hardware needed | ❌ Significant | ✅ None |
| Latest model versions | ❌ Manual updates | ✅ Automatic |
Local deployment is best for: enterprise privacy requirements, high-volume production where GPU costs amortize below API pricing, and research/fine-tuning workflows.
Hardware Requirements
DeepSeek-V4-Flash (284B / 13B active)
Full precision (FP8 + FP4 Mixed):
- Download size: ~160 GB
- VRAM needed: ~160 GB
- Recommended GPU: 2× NVIDIA H100 80GB, or 2× H200, or 4× A100 40GB
Quantized (community GGUF/GPTQ):
- Size: ~80 GB (4-bit quantized)
- VRAM needed: ~80 GB
- Feasible on: 1× NVIDIA RTX 5090, or 2× RTX 4090 (24GB each = 48GB — insufficient alone, needs CPU offload)
- With CPU offload: RTX 5090 + 64 GB+ system RAM
DeepSeek-V4-Pro (1.6T / 49B active)
Full precision (FP8 + FP4 Mixed):
- Download size: ~865 GB
- VRAM needed: ~865 GB
- Recommended cluster: 16× NVIDIA H100 80GB, or equivalent
- Minimum viable: 12× H100 80GB with optimized serving
Quantized (community builds):
- Size: ~200–400 GB (4-bit or 8-bit quantized)
- VRAM needed: ~200–400 GB
- Feasible on: 4–8× H100 80GB, or 8–16× A100 40GB
Honest assessment: V4-Pro local deployment is only practical for organizations with significant GPU infrastructure. V4-Flash is the accessible option for individuals and small teams.
Step 1: Download the Model Weights
Using HuggingFace CLI (Recommended)
# Install the CLI
pip install huggingface_hub
# Download V4-Flash instruct model (~160 GB)
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
--local-dir ./models/DeepSeek-V4-Flash \
--resume-download
# Download V4-Flash Base (optional, for fine-tuning)
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash-Base \
--local-dir ./models/DeepSeek-V4-Flash-Base \
--resume-download
The --resume-download flag is critical for these large downloads — it allows you to restart interrupted downloads without losing progress.
From ModelScope (Faster in China)
pip install modelscope
modelscope download --model deepseek-ai/DeepSeek-V4-Flash --local-dir ./models/DeepSeek-V4-Flash
Step 2: Set Up the Inference Environment
DeepSeek V4 requires custom encoding scripts for the chat template. Clone the model's inference tools:
# Clone just the inference folder from the repository
git clone --depth 1 https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash ./DeepSeek-V4-Flash-repo
cd DeepSeek-V4-Flash-repo
Install dependencies:
pip install transformers torch accelerate
Step 3: Run Basic Inference
Use the provided encoding scripts:
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
import transformers
import torch
model_path = "./models/DeepSeek-V4-Flash"
# Load tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
# Load model (with automatic device mapping for multi-GPU)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto", # Distributes across available GPUs
torch_dtype=torch.float8_e4m3fn,
trust_remote_code=True
)
# Encode a conversation
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a linked list."}
]
# Non-thinking mode
prompt = encode_messages(messages, thinking_mode="no_think")
inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
# Generate
with torch.no_grad():
output = model.generate(
inputs,
max_new_tokens=2048,
temperature=1.0,
top_p=1.0,
do_sample=True
)
response_text = tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=False)
print(parse_message_from_completion_text(response_text))
Step 4: For Community Quantizations (llama.cpp / Ollama)
If your hardware is limited, community-provided quantizations dramatically reduce requirements:
Using Ollama (Easiest)
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull community-quantized V4-Flash (check Ollama library for available versions)
ollama pull deepseek-v4-flash:q4_k_m
# Run it
ollama run deepseek-v4-flash:q4_k_m
Using llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j8
# Download GGUF quantized V4-Flash from HuggingFace community repos
# Then run:
./llama-cli -m DeepSeek-V4-Flash-Q4_K_M.gguf \
-n 2048 \
--ctx-size 8192 \
-p "You are a helpful assistant."
Recommended Sampling Parameters
DeepSeek officially recommends:
temperature = 1.0
top_p = 1.0
For Think Max mode, ensure your context window is set to at least 384K tokens.
Performance Expectations
| Hardware | Model | Throughput (approx.) |
|---|---|---|
| 2× H100 80GB | V4-Flash | ~40–80 tokens/sec |
| 4× A100 40GB | V4-Flash | ~20–40 tokens/sec |
| 8× H100 80GB | V4-Flash | ~100–150 tokens/sec |
| 16× H100 80GB | V4-Pro | ~15–30 tokens/sec |
| RTX 5090 (quantized) | V4-Flash Q4 | ~5–15 tokens/sec |
These are rough estimates — actual throughput depends on context length, batch size, and framework optimizations.
Privacy Benefits for Enterprise
For enterprises with sensitive data — healthcare records, legal documents, financial data — local deployment of DeepSeek V4 means zero data leaves your infrastructure. Unlike API-based services, there's no data retention, no logging on third-party servers, and no compliance concerns about sending proprietary information to external APIs.
This is particularly relevant for platforms like Framia.pro enterprise customers who need AI-powered creative tools without data sovereignty concerns.
Conclusion
Running DeepSeek V4-Flash locally is feasible on a dual-H100 setup or high-end quantized hardware. V4-Pro requires significant GPU infrastructure but delivers unmatched open-source capability. The MIT license means you own the deployment completely — a key advantage for privacy-sensitive and high-volume use cases.