How to Run DeepSeek V4 Locally: Full Setup Guide (2026)

Learn how to run DeepSeek V4-Flash and V4-Pro locally. Covers hardware requirements, download steps, inference setup, quantized options, and performance benchmarks.

How to Run DeepSeek V4 Locally: Hardware Requirements and Setup Guide

Running DeepSeek V4 locally gives you complete privacy, no per-token API costs, and full control over inference settings. Both V4-Pro and V4-Flash are MIT-licensed open-weight models available for free download from HuggingFace. Here's everything you need to know to run them on your own hardware.

Should You Run V4 Locally or Use the API?

Before diving into setup, consider your use case:

Factor	Local Deployment	API
Cost (high volume)	✅ Lower (hardware amortized)	❌ Per-token fees
Privacy	✅ Complete	❌ Data sent to DeepSeek
Setup complexity	❌ High	✅ Zero
Latency	✅ No network round-trip	❌ Network dependent
Hardware needed	❌ Significant	✅ None
Latest model versions	❌ Manual updates	✅ Automatic

Local deployment is best for: enterprise privacy requirements, high-volume production where GPU costs amortize below API pricing, and research/fine-tuning workflows.

Hardware Requirements

DeepSeek-V4-Flash (284B / 13B active)

Full precision (FP8 + FP4 Mixed):

Download size: ~160 GB
VRAM needed: ~160 GB
Recommended GPU: 2× NVIDIA H100 80GB, or 2× H200, or 4× A100 40GB

Quantized (community GGUF/GPTQ):

Size: ~80 GB (4-bit quantized)
VRAM needed: ~80 GB
Feasible on: 1× NVIDIA RTX 5090, or 2× RTX 4090 (24GB each = 48GB — insufficient alone, needs CPU offload)
With CPU offload: RTX 5090 + 64 GB+ system RAM

DeepSeek-V4-Pro (1.6T / 49B active)

Full precision (FP8 + FP4 Mixed):

Download size: ~865 GB
VRAM needed: ~865 GB
Recommended cluster: 16× NVIDIA H100 80GB, or equivalent
Minimum viable: 12× H100 80GB with optimized serving

Quantized (community builds):

Size: ~200–400 GB (4-bit or 8-bit quantized)
VRAM needed: ~200–400 GB
Feasible on: 4–8× H100 80GB, or 8–16× A100 40GB

Honest assessment: V4-Pro local deployment is only practical for organizations with significant GPU infrastructure. V4-Flash is the accessible option for individuals and small teams.

Step 1: Download the Model Weights

Using HuggingFace CLI (Recommended)

# Install the CLI
pip install huggingface_hub

# Download V4-Flash instruct model (~160 GB)
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./models/DeepSeek-V4-Flash \
  --resume-download

# Download V4-Flash Base (optional, for fine-tuning)
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash-Base \
  --local-dir ./models/DeepSeek-V4-Flash-Base \
  --resume-download

The --resume-download flag is critical for these large downloads — it allows you to restart interrupted downloads without losing progress.

From ModelScope (Faster in China)

pip install modelscope
modelscope download --model deepseek-ai/DeepSeek-V4-Flash --local-dir ./models/DeepSeek-V4-Flash

Step 2: Set Up the Inference Environment

DeepSeek V4 requires custom encoding scripts for the chat template. Clone the model's inference tools:

# Clone just the inference folder from the repository
git clone --depth 1 https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash ./DeepSeek-V4-Flash-repo
cd DeepSeek-V4-Flash-repo

Install dependencies:

pip install transformers torch accelerate

Step 3: Run Basic Inference

Use the provided encoding scripts:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text
import transformers
import torch

model_path = "./models/DeepSeek-V4-Flash"

# Load tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)

# Load model (with automatic device mapping for multi-GPU)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",        # Distributes across available GPUs
    torch_dtype=torch.float8_e4m3fn,
    trust_remote_code=True
)

# Encode a conversation
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a linked list."}
]

# Non-thinking mode
prompt = encode_messages(messages, thinking_mode="no_think")
inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    output = model.generate(
        inputs,
        max_new_tokens=2048,
        temperature=1.0,
        top_p=1.0,
        do_sample=True
    )

response_text = tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=False)
print(parse_message_from_completion_text(response_text))

Step 4: For Community Quantizations (llama.cpp / Ollama)

If your hardware is limited, community-provided quantizations dramatically reduce requirements:

Using Ollama (Easiest)

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull community-quantized V4-Flash (check Ollama library for available versions)
ollama pull deepseek-v4-flash:q4_k_m

# Run it
ollama run deepseek-v4-flash:q4_k_m

Using llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j8

# Download GGUF quantized V4-Flash from HuggingFace community repos
# Then run:
./llama-cli -m DeepSeek-V4-Flash-Q4_K_M.gguf \
  -n 2048 \
  --ctx-size 8192 \
  -p "You are a helpful assistant."

Recommended Sampling Parameters

DeepSeek officially recommends:

temperature = 1.0
top_p = 1.0

For Think Max mode, ensure your context window is set to at least 384K tokens.

Performance Expectations

Hardware	Model	Throughput (approx.)
2× H100 80GB	V4-Flash	~40–80 tokens/sec
4× A100 40GB	V4-Flash	~20–40 tokens/sec
8× H100 80GB	V4-Flash	~100–150 tokens/sec
16× H100 80GB	V4-Pro	~15–30 tokens/sec
RTX 5090 (quantized)	V4-Flash Q4	~5–15 tokens/sec

These are rough estimates — actual throughput depends on context length, batch size, and framework optimizations.

Privacy Benefits for Enterprise

For enterprises with sensitive data — healthcare records, legal documents, financial data — local deployment of DeepSeek V4 means zero data leaves your infrastructure. Unlike API-based services, there's no data retention, no logging on third-party servers, and no compliance concerns about sending proprietary information to external APIs.

This is particularly relevant for platforms like Framia.pro enterprise customers who need AI-powered creative tools without data sovereignty concerns.

Conclusion

Running DeepSeek V4-Flash locally is feasible on a dual-H100 setup or high-end quantized hardware. V4-Pro requires significant GPU infrastructure but delivers unmatched open-source capability. The MIT license means you own the deployment completely — a key advantage for privacy-sensitive and high-volume use cases.