Can I Run a Local LLM on My PC?
To run an LLM locally, VRAM (GPU memory) matching the model size is the key factor. Even for the same model, the required memory varies significantly depending on the quantization level. This post organizes system requirements for major open-source LLMs by size and compares costs for API-only models.
What Is Quantization?
Quantization is a technique that reduces memory and disk usage by lowering the precision of model weights. While there is some quality loss, Q4 quantization (4-bit) maintains similar performance using roughly 1/4 the memory compared to FP16 (16-bit).
| Quantization | Bit Width | Memory Ratio | Quality | When to Use |
|---|---|---|---|---|
| FP16 | 16-bit | 100% (baseline) | Best | When GPU headroom is available |
| Q8_0 | 8-bit | ~50% | Very good | High quality + memory savings |
| Q4_K_M | 4-bit | ~25% | Good | Standard for local execution |
| Q2_K | 2-bit | ~12.5% | Noticeable degradation | Extreme memory constraints |
For most local users, Q4_K_M is the optimal balance between quality and memory.
Small Models (1B—3B): Lightweight Tasks
These models can run on laptops or low-spec PCs. Suitable for simple summarization, translation, and chatbots.
| Model | Parameters | Q4 VRAM | FP16 VRAM | Q4 Disk | Recommended GPU |
|---|---|---|---|---|---|
| Llama 3.2 1B | 1B | ~1GB | ~3GB | ~0.7GB | 4GB GPU or above |
| Llama 3.2 3B | 3B | ~2.5GB | ~6GB | ~2GB | RTX 3060 8GB |
| Gemma 4 E2B | 2.3B | ~2GB | ~5GB | ~1.5GB | 4GB GPU or above |
| Gemma 4 E4B | 4.5B | ~4GB | ~15GB | ~3GB | 8GB GPU or above |
| Phi-4 Mini | 3.8B | ~2.1GB | ~7.5GB | ~2.1GB | 8GB GPU or above |
# Run a small model directly with Ollama
ollama pull gemma4:e4b
ollama run gemma4:e4b "Explain the difference between lists and tuples in Python"
# Q4 quantization applied by default, uses ~4GB VRAM
Gemma 4 E2B/E4B are models that maximize parameter efficiency using PLE (Per-Layer Embeddings). They support 128K context and multimodal capabilities (vision + audio) while remaining runnable on mobile and edge devices.
Mid-Size Models (7B—14B): General-Purpose Local LLM
This is the most popular size range. Runnable on RTX 4060 Ti (16GB) to RTX 4090 (24GB), delivering GPT-3.5 level performance locally.
| Model | Parameters | Q4 VRAM | FP16 VRAM | Q4 Disk | Recommended GPU |
|---|---|---|---|---|---|
| Llama 3.1 8B | 8B | ~5GB | ~16GB | ~4.5GB | RTX 4060 Ti 16GB |
| Mistral 7B v0.3 | 7.3B | ~4.5GB | ~14.5GB | ~4GB | RTX 3060 12GB |
| Gemma 2 9B | 9.2B | ~5.5GB | ~18GB | ~5GB | RTX 4060 Ti 16GB |
| Qwen 2.5 7B | 7.6B | ~5GB | ~17GB | ~4.5GB | RTX 4060 Ti 16GB |
| Qwen 2.5 14B | 14.8B | ~9GB | ~30GB | ~8GB | RTX 4090 24GB |
| Phi-4 | 14B | ~8GB | ~28GB | ~8GB | RTX 4090 24GB |
# Running a mid-size model
ollama pull qwen2.5:14b
ollama run qwen2.5:14b "Analyze the time complexity of this code: def fib(n): ..."
# Q4 quantization, ~9GB VRAM, generates ~40 tokens/sec on RTX 4090
Model strengths: Llama 3.1 8B is a general-purpose all-rounder, Qwen 2.5 excels in multilingual (CJK) tasks, and Phi-4 is strong in reasoning and math.
Gemma 4: The Latest 2026 Model
Released by Google in April 2026, Gemma 4 shows dramatically improved benchmark performance over previous generations. Notably, the 26B MoE model scored 89.2% on the AIME 2026 math benchmark (Gemma 3 27B: 20.8%) and 80.0% on LiveCodeBench coding (Gemma 3: 29.1%). It’s available under the Apache 2.0 license for free commercial use.
| Feature | Gemma 4 |
|---|---|
| Multimodal | Native vision + audio support |
| Context | E2B/E4B: 128K, 26B/31B: 256K |
| Languages | 140+ languages supported |
| Tool use | Native function calling |
| License | Apache 2.0 (fully commercial) |
Large Models (27B—72B): High-Performance Local
Requires multi-GPU or high-capacity single GPU setups. Can deliver near-GPT-4 performance locally.
| Model | Parameters | Q4 VRAM | FP16 VRAM | Q4 Disk | Recommended GPU |
|---|---|---|---|---|---|
| Gemma 4 26B-A4B (MoE) | 26B (3.8B active) | ~18GB | ~28GB (int8) | ~16GB | RTX 4070 Ti 16GB |
| Gemma 4 31B | 30.7B | ~20GB | ~34GB (int8) | ~19GB | RTX 4090 24GB |
| Qwen 2.5 32B | 32.5B | ~20GB | ~65GB | ~19GB | RTX 4090 or A6000 48GB |
| Llama 3.3 70B | 70B | ~43GB | ~140GB | ~40GB | 2x RTX 4090 or A100 80GB |
| Qwen 2.5 72B | 72.7B | ~36GB | ~144GB | ~36GB | 2x RTX 4090 or A100 80GB |
| Mixtral 8x7B | 46.7B (12.9B active) | ~26GB | ~93GB | ~26GB | RTX 4090 (Q4) |
# Running a 70B model with multi-GPU
ollama pull llama3.3:70b-instruct-q4_K_M
ollama run llama3.3:70b
# Q4 quantization, ~43GB VRAM -> Runnable on 2x RTX 4090 (48GB)
Mixtral 8x7B uses a MoE (Mixture of Experts) architecture that activates only 12.9B parameters per token. However, all 46.7B parameters must be loaded into memory, so VRAM requirements are based on the full model size.
Extra-Large Models (100B+): Data Center Scale
Virtually impossible to run on consumer PCs — these are used through server clusters or APIs.
| Model | Parameters | Q4 VRAM | Disk | Required Hardware |
|---|---|---|---|---|
| Mistral Large 2 | 123B | ~58GB | ~58GB | Multi-GPU or API |
| Llama 3.1 405B | 405B | ~230GB | ~230GB | 8x A100/H100 |
| DeepSeek V3 | 671B (37B active) | ~386GB | ~350GB | 8x H100 80GB |
API-Only Models: No Hardware Required
When local execution is impractical or inefficient, you can access top-performing models via API.
| Model | Context | Input $/M tokens | Output $/M tokens | Features |
|---|---|---|---|---|
| GPT-4o | 128K | $2.50 | $10.00 | Multimodal (text + image + audio) |
| GPT-4o Mini | 128K | $0.15 | $0.60 | Low cost, vision support |
| Claude Sonnet 4.6 | 1M | $3.00 | $15.00 | Coding specialist, excellent value |
| Claude Opus 4.6 | 1M | $5.00 | $25.00 | Top performance, extended thinking |
| Claude Haiku 4.5 | 200K | $1.00 | $5.00 | Fast responses, lowest price |
| DeepSeek V3 API | 128K | ~$0.27 | ~$1.10 | Best open source, ultra-low cost |
Cost reduction tips: Claude’s prompt caching (90% savings) and Batch API (50% discount) can significantly reduce costs.
Recommended Models by Use Case
| Use Case | Recommended Model | Reason |
|---|---|---|
| Personal chatbot/learning | Gemma 4 E4B, Phi-4 Mini | Runs on low-spec PCs, multimodal support |
| Coding assistant | Gemma 4 26B MoE, Qwen 2.5 14B | 80% on LiveCodeBench, excellent code analysis |
| CJK language specialty | Qwen 2.5 32B/72B | Best multilingual CJK performance |
| Document analysis/RAG | Claude Sonnet 4.6 | 1M context, accurate extraction |
| Cost minimization | DeepSeek V3 API | Best open source + ultra-low cost API |
| Highest quality needed | Claude Opus 4.6, GPT-4o | Complex reasoning, creative tasks |
Summary
The key to choosing an LLM is a model size that fits your hardware and model characteristics that match your use case. Here are the key takeaways:
- Q4 quantization is the standard for local execution — maintains practical quality at 1/4 the memory of FP16
- 8B models (~5GB VRAM) run fine on RTX 4060 Ti; 70B models need 2x RTX 4090
- Gemma 4 delivers top coding/math benchmark performance with MoE 26B under Apache 2.0 license
- Qwen 2.5 is the strongest open-source model for multilingual tasks including CJK languages
- If you don’t have a GPU, API usage is realistic — DeepSeek V3 offers the best value
- MoE models (Mixtral, DeepSeek) have fewer active parameters but require VRAM for the full model size