Why You Need a Local LLM
Cloud APIs are convenient, but they come with constraints around cost, privacy, and internet dependency. A local LLM means running an AI model directly on your own computer. It’s like having a coffee machine at home — no need to visit a cafe, you can make the coffee you want whenever you want.
Key advantages of local LLMs:
| Advantage | Description |
|---|---|
| Cost savings | No API call costs (only electricity) |
| Privacy | Data never leaves your machine |
| Offline use | Works without internet |
| Customization | Free to fine-tune and customize prompts |
| Latency | No network delay |
Tool-by-Tool Comparison
| Feature | Ollama | llama.cpp | vLLM |
|---|---|---|---|
| Difficulty | Very easy | Medium | Medium |
| Installation | One-click install | Requires compilation | pip install |
| GPU required | No | No | Recommended (CUDA) |
| API server | Built-in (OpenAI compatible) | Separate setup | Built-in |
| Quantization | GGUF | GGUF (core feature) | AWQ, GPTQ |
| Batch processing | Limited | Limited | Excellent (PagedAttention) |
| Best for | Personal dev, experiments | Embedded, optimization | Production serving |
Ollama: The Easiest Way to Start
Ollama manages LLMs with pull and run commands, just like Docker. It supports macOS, Linux, and Windows.
Installation and Basic Usage
# macOS/Linux installation
curl -fsSL https://ollama.ai/install.sh | sh
# Download and run model
ollama pull llama3.2 # Download model (3B, ~2GB)
ollama run llama3.2 # Run in chat mode
# List models
ollama list
# NAME SIZE MODIFIED
# llama3.2:latest 2.0 GB 2 minutes ago
# gemma2:2b 1.6 GB 1 hour ago
# Delete model
ollama rm gemma2:2b
Using as an API Server
Ollama provides an OpenAI-compatible API by default. You can use existing OpenAI SDK code with almost no modifications.
# pip install openai
from openai import OpenAI
# Connect to Ollama local server (default port: 11434)
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Ollama doesn't require auth, any value works
)
# ChatCompletion call (same interface as OpenAI API)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "Answer in English."},
{"role": "user", "content": "Explain Python list comprehensions."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
# Output: List comprehension is a concise Python syntax for creating
# new lists based on existing lists...
# Streaming response
stream = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Custom Models with Modelfile
# Modelfile -- Custom model definition
FROM llama3.2
# Set system prompt
SYSTEM """
You are a Python coding expert.
Always include comments in your code.
Provide concise and practical answers.
"""
# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Build and run custom model
ollama create python-expert -f Modelfile
ollama run python-expert
# >>> Write a Fibonacci function
# def fibonacci(n):
# """Returns the nth Fibonacci number."""
# ...
llama.cpp: The Ultimate in Optimization
llama.cpp is an LLM inference engine written in C/C++ that delivers optimal performance even in CPU-only environments. Through quantization, model sizes can be dramatically reduced.
Quantization Level Comparison
| Quantization | Bits | Model Size (7B) | Quality | Speed |
|---|---|---|---|---|
| FP16 | 16-bit | ~14GB | Best | Slow |
| Q8_0 | 8-bit | ~7GB | Very good | Medium |
| Q5_K_M | 5-bit | ~5GB | Good | Fast |
| Q4_K_M | 4-bit | ~4GB | Acceptable | Fast |
| Q3_K_M | 3-bit | ~3GB | Some degradation | Very fast |
| Q2_K | 2-bit | ~2.5GB | Significant degradation | Very fast |
# Build llama.cpp (CPU only)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)
# Build with CUDA GPU support
make -j$(nproc) GGML_CUDA=1
# Download GGUF model (from Hugging Face)
# e.g., Q4_K_M version from TheBloke/Llama-2-7B-Chat-GGUF
# Run (text generation)
./llama-cli \
-m models/llama-3.2-3b-q4_k_m.gguf \
-p "What is a decorator in Python" \
-n 256 \ # Max tokens to generate
-t 8 \ # Number of CPU threads
--temp 0.7
# Output: A decorator in Python is a function that wraps another function...
# Run API server
./llama-server \
-m models/llama-3.2-3b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080
# OpenAI-compatible API runs at http://localhost:8080
vLLM: Production Serving
vLLM is an engine optimized for high-throughput LLM serving. Its PagedAttention algorithm efficiently manages GPU memory, delivering excellent concurrent request processing performance.
# Install vLLM (requires CUDA GPU)
pip install vllm
# Run API server (OpenAI compatible)
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
# Request to vLLM server (using OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-placeholder"
)
# Single request
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "3 REST API design principles"}],
max_tokens=300
)
print(response.choices[0].message.content)
# Batch processing (vLLM's strength -- high throughput)
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="token-placeholder"
)
async def batch_inference(prompts: list[str]):
"""Processes multiple prompts concurrently."""
tasks = [
async_client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": p}],
max_tokens=200
)
for p in prompts
]
# Send requests concurrently -- vLLM batches internally
results = await asyncio.gather(*tasks)
return [r.choices[0].message.content for r in results]
prompts = ["What is Python?", "What is JavaScript?", "What is Rust?"]
answers = asyncio.run(batch_inference(prompts))
for p, a in zip(prompts, answers):
print(f"Q: {p}\nA: {a}\n")
Model Selection Guide by Hardware
| Hardware | RAM/VRAM | Recommended Model | Tool |
|---|---|---|---|
| Laptop (8GB RAM) | 8GB | Llama 3.2 1B (Q4) | Ollama |
| Desktop (16GB RAM) | 16GB | Llama 3.2 3B (Q5) | Ollama |
| Desktop (32GB RAM) | 32GB | Llama 3.1 8B (Q5) | Ollama, llama.cpp |
| RTX 3060 (12GB) | 12GB VRAM | Llama 3.1 8B (Q4) | Ollama (GPU) |
| RTX 4090 (24GB) | 24GB VRAM | Llama 3.1 70B (Q4) | vLLM |
| A100 (80GB) | 80GB VRAM | Llama 3.1 70B (FP16) | vLLM |
Practical Tips
- Start with Ollama: You can have an LLM running in 5 minutes. Switch to llama.cpp or vLLM later as needed.
- Use Q4_K_M quantization as the default: It offers the best balance of quality and size. If quality isn’t sufficient, step up to Q5_K_M.
- Watch your context length: Local models have limited context lengths. Adjust with the
num_ctxsetting, but note that longer contexts increase memory usage. - Always use your GPU if you have one: It’s 5—20x faster than CPU. Ollama auto-detects GPUs; llama.cpp needs to be built with
GGML_CUDA=1. - Quantization quality matters more than model size: An 8B Q5 often produces better results than a 70B Q2.
- Look for GGUF format on Hugging Face: Users like
TheBlokeandbartowskiupload various quantized versions.