Local LLM Execution Guide -- Ollama, llama.cpp, vLLM

Why You Need a Local LLM

Cloud APIs are convenient, but they come with constraints around cost, privacy, and internet dependency. A local LLM means running an AI model directly on your own computer. It’s like having a coffee machine at home — no need to visit a cafe, you can make the coffee you want whenever you want.

Key advantages of local LLMs:

Advantage	Description
Cost savings	No API call costs (only electricity)
Privacy	Data never leaves your machine
Offline use	Works without internet
Customization	Free to fine-tune and customize prompts
Latency	No network delay

Tool-by-Tool Comparison

Feature	Ollama	llama.cpp	vLLM
Difficulty	Very easy	Medium	Medium
Installation	One-click install	Requires compilation	pip install
GPU required	No	No	Recommended (CUDA)
API server	Built-in (OpenAI compatible)	Separate setup	Built-in
Quantization	GGUF	GGUF (core feature)	AWQ, GPTQ
Batch processing	Limited	Limited	Excellent (PagedAttention)
Best for	Personal dev, experiments	Embedded, optimization	Production serving

Ollama: The Easiest Way to Start

Ollama manages LLMs with pull and run commands, just like Docker. It supports macOS, Linux, and Windows.

Installation and Basic Usage

# macOS/Linux installation
curl -fsSL https://ollama.ai/install.sh | sh

# Download and run model
ollama pull llama3.2        # Download model (3B, ~2GB)
ollama run llama3.2         # Run in chat mode

# List models
ollama list
# NAME              SIZE      MODIFIED
# llama3.2:latest   2.0 GB    2 minutes ago
# gemma2:2b         1.6 GB    1 hour ago

# Delete model
ollama rm gemma2:2b

Using as an API Server

Ollama provides an OpenAI-compatible API by default. You can use existing OpenAI SDK code with almost no modifications.

# pip install openai
from openai import OpenAI

# Connect to Ollama local server (default port: 11434)
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't require auth, any value works
)

# ChatCompletion call (same interface as OpenAI API)
response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "Answer in English."},
        {"role": "user", "content": "Explain Python list comprehensions."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
# Output: List comprehension is a concise Python syntax for creating
#          new lists based on existing lists...

# Streaming response
stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Custom Models with Modelfile

# Modelfile -- Custom model definition
FROM llama3.2

# Set system prompt
SYSTEM """
You are a Python coding expert.
Always include comments in your code.
Provide concise and practical answers.
"""

# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Build and run custom model
ollama create python-expert -f Modelfile
ollama run python-expert
# >>> Write a Fibonacci function
# def fibonacci(n):
#     """Returns the nth Fibonacci number."""
#     ...

llama.cpp: The Ultimate in Optimization

llama.cpp is an LLM inference engine written in C/C++ that delivers optimal performance even in CPU-only environments. Through quantization, model sizes can be dramatically reduced.

Quantization Level Comparison

Quantization	Bits	Model Size (7B)	Quality	Speed
FP16	16-bit	~14GB	Best	Slow
Q8_0	8-bit	~7GB	Very good	Medium
Q5_K_M	5-bit	~5GB	Good	Fast
Q4_K_M	4-bit	~4GB	Acceptable	Fast
Q3_K_M	3-bit	~3GB	Some degradation	Very fast
Q2_K	2-bit	~2.5GB	Significant degradation	Very fast

# Build llama.cpp (CPU only)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# Build with CUDA GPU support
make -j$(nproc) GGML_CUDA=1

# Download GGUF model (from Hugging Face)
# e.g., Q4_K_M version from TheBloke/Llama-2-7B-Chat-GGUF

# Run (text generation)
./llama-cli \
    -m models/llama-3.2-3b-q4_k_m.gguf \
    -p "What is a decorator in Python" \
    -n 256 \         # Max tokens to generate
    -t 8 \           # Number of CPU threads
    --temp 0.7
# Output: A decorator in Python is a function that wraps another function...

# Run API server
./llama-server \
    -m models/llama-3.2-3b-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080
# OpenAI-compatible API runs at http://localhost:8080

vLLM: Production Serving

vLLM is an engine optimized for high-throughput LLM serving. Its PagedAttention algorithm efficiently manages GPU memory, delivering excellent concurrent request processing performance.

# Install vLLM (requires CUDA GPU)
pip install vllm

# Run API server (OpenAI compatible)
vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9

# Request to vLLM server (using OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-placeholder"
)

# Single request
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "3 REST API design principles"}],
    max_tokens=300
)
print(response.choices[0].message.content)

# Batch processing (vLLM's strength -- high throughput)
import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-placeholder"
)

async def batch_inference(prompts: list[str]):
    """Processes multiple prompts concurrently."""
    tasks = [
        async_client.chat.completions.create(
            model="meta-llama/Llama-3.2-3B-Instruct",
            messages=[{"role": "user", "content": p}],
            max_tokens=200
        )
        for p in prompts
    ]
    # Send requests concurrently -- vLLM batches internally
    results = await asyncio.gather(*tasks)
    return [r.choices[0].message.content for r in results]

prompts = ["What is Python?", "What is JavaScript?", "What is Rust?"]
answers = asyncio.run(batch_inference(prompts))
for p, a in zip(prompts, answers):
    print(f"Q: {p}\nA: {a}\n")

Model Selection Guide by Hardware

Hardware	RAM/VRAM	Recommended Model	Tool
Laptop (8GB RAM)	8GB	Llama 3.2 1B (Q4)	Ollama
Desktop (16GB RAM)	16GB	Llama 3.2 3B (Q5)	Ollama
Desktop (32GB RAM)	32GB	Llama 3.1 8B (Q5)	Ollama, llama.cpp
RTX 3060 (12GB)	12GB VRAM	Llama 3.1 8B (Q4)	Ollama (GPU)
RTX 4090 (24GB)	24GB VRAM	Llama 3.1 70B (Q4)	vLLM
A100 (80GB)	80GB VRAM	Llama 3.1 70B (FP16)	vLLM

Practical Tips

Start with Ollama: You can have an LLM running in 5 minutes. Switch to llama.cpp or vLLM later as needed.
Use Q4_K_M quantization as the default: It offers the best balance of quality and size. If quality isn’t sufficient, step up to Q5_K_M.
Watch your context length: Local models have limited context lengths. Adjust with the num_ctx setting, but note that longer contexts increase memory usage.
Always use your GPU if you have one: It’s 5—20x faster than CPU. Ollama auto-detects GPUs; llama.cpp needs to be built with GGML_CUDA=1.
Quantization quality matters more than model size: An 8B Q5 often produces better results than a 70B Q2.
Look for GGUF format on Hugging Face: Users like TheBloke and bartowski upload various quantized versions.