Local LLM Execution Guide -- Ollama, llama.cpp, vLLM

Why You Need a Local LLM

Cloud APIs are convenient, but they come with constraints around cost, privacy, and internet dependency. A local LLM means running an AI model directly on your own computer. It’s like having a coffee machine at home — no need to visit a cafe, you can make the coffee you want whenever you want.

Key advantages of local LLMs:

AdvantageDescription
Cost savingsNo API call costs (only electricity)
PrivacyData never leaves your machine
Offline useWorks without internet
CustomizationFree to fine-tune and customize prompts
LatencyNo network delay

Tool-by-Tool Comparison

FeatureOllamallama.cppvLLM
DifficultyVery easyMediumMedium
InstallationOne-click installRequires compilationpip install
GPU requiredNoNoRecommended (CUDA)
API serverBuilt-in (OpenAI compatible)Separate setupBuilt-in
QuantizationGGUFGGUF (core feature)AWQ, GPTQ
Batch processingLimitedLimitedExcellent (PagedAttention)
Best forPersonal dev, experimentsEmbedded, optimizationProduction serving

Ollama: The Easiest Way to Start

Ollama manages LLMs with pull and run commands, just like Docker. It supports macOS, Linux, and Windows.

Installation and Basic Usage

# macOS/Linux installation
curl -fsSL https://ollama.ai/install.sh | sh

# Download and run model
ollama pull llama3.2        # Download model (3B, ~2GB)
ollama run llama3.2         # Run in chat mode

# List models
ollama list
# NAME              SIZE      MODIFIED
# llama3.2:latest   2.0 GB    2 minutes ago
# gemma2:2b         1.6 GB    1 hour ago

# Delete model
ollama rm gemma2:2b

Using as an API Server

Ollama provides an OpenAI-compatible API by default. You can use existing OpenAI SDK code with almost no modifications.

# pip install openai
from openai import OpenAI

# Connect to Ollama local server (default port: 11434)
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't require auth, any value works
)

# ChatCompletion call (same interface as OpenAI API)
response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "Answer in English."},
        {"role": "user", "content": "Explain Python list comprehensions."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
# Output: List comprehension is a concise Python syntax for creating
#          new lists based on existing lists...

# Streaming response
stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Custom Models with Modelfile

# Modelfile -- Custom model definition
FROM llama3.2

# Set system prompt
SYSTEM """
You are a Python coding expert.
Always include comments in your code.
Provide concise and practical answers.
"""

# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Build and run custom model
ollama create python-expert -f Modelfile
ollama run python-expert
# >>> Write a Fibonacci function
# def fibonacci(n):
#     """Returns the nth Fibonacci number."""
#     ...

llama.cpp: The Ultimate in Optimization

llama.cpp is an LLM inference engine written in C/C++ that delivers optimal performance even in CPU-only environments. Through quantization, model sizes can be dramatically reduced.

Quantization Level Comparison

QuantizationBitsModel Size (7B)QualitySpeed
FP1616-bit~14GBBestSlow
Q8_08-bit~7GBVery goodMedium
Q5_K_M5-bit~5GBGoodFast
Q4_K_M4-bit~4GBAcceptableFast
Q3_K_M3-bit~3GBSome degradationVery fast
Q2_K2-bit~2.5GBSignificant degradationVery fast
# Build llama.cpp (CPU only)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# Build with CUDA GPU support
make -j$(nproc) GGML_CUDA=1

# Download GGUF model (from Hugging Face)
# e.g., Q4_K_M version from TheBloke/Llama-2-7B-Chat-GGUF

# Run (text generation)
./llama-cli \
    -m models/llama-3.2-3b-q4_k_m.gguf \
    -p "What is a decorator in Python" \
    -n 256 \         # Max tokens to generate
    -t 8 \           # Number of CPU threads
    --temp 0.7
# Output: A decorator in Python is a function that wraps another function...

# Run API server
./llama-server \
    -m models/llama-3.2-3b-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080
# OpenAI-compatible API runs at http://localhost:8080

vLLM: Production Serving

vLLM is an engine optimized for high-throughput LLM serving. Its PagedAttention algorithm efficiently manages GPU memory, delivering excellent concurrent request processing performance.

# Install vLLM (requires CUDA GPU)
pip install vllm

# Run API server (OpenAI compatible)
vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9
# Request to vLLM server (using OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-placeholder"
)

# Single request
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "3 REST API design principles"}],
    max_tokens=300
)
print(response.choices[0].message.content)

# Batch processing (vLLM's strength -- high throughput)
import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-placeholder"
)

async def batch_inference(prompts: list[str]):
    """Processes multiple prompts concurrently."""
    tasks = [
        async_client.chat.completions.create(
            model="meta-llama/Llama-3.2-3B-Instruct",
            messages=[{"role": "user", "content": p}],
            max_tokens=200
        )
        for p in prompts
    ]
    # Send requests concurrently -- vLLM batches internally
    results = await asyncio.gather(*tasks)
    return [r.choices[0].message.content for r in results]

prompts = ["What is Python?", "What is JavaScript?", "What is Rust?"]
answers = asyncio.run(batch_inference(prompts))
for p, a in zip(prompts, answers):
    print(f"Q: {p}\nA: {a}\n")

Model Selection Guide by Hardware

HardwareRAM/VRAMRecommended ModelTool
Laptop (8GB RAM)8GBLlama 3.2 1B (Q4)Ollama
Desktop (16GB RAM)16GBLlama 3.2 3B (Q5)Ollama
Desktop (32GB RAM)32GBLlama 3.1 8B (Q5)Ollama, llama.cpp
RTX 3060 (12GB)12GB VRAMLlama 3.1 8B (Q4)Ollama (GPU)
RTX 4090 (24GB)24GB VRAMLlama 3.1 70B (Q4)vLLM
A100 (80GB)80GB VRAMLlama 3.1 70B (FP16)vLLM

Practical Tips

  • Start with Ollama: You can have an LLM running in 5 minutes. Switch to llama.cpp or vLLM later as needed.
  • Use Q4_K_M quantization as the default: It offers the best balance of quality and size. If quality isn’t sufficient, step up to Q5_K_M.
  • Watch your context length: Local models have limited context lengths. Adjust with the num_ctx setting, but note that longer contexts increase memory usage.
  • Always use your GPU if you have one: It’s 5—20x faster than CPU. Ollama auto-detects GPUs; llama.cpp needs to be built with GGML_CUDA=1.
  • Quantization quality matters more than model size: An 8B Q5 often produces better results than a 70B Q2.
  • Look for GGUF format on Hugging Face: Users like TheBloke and bartowski upload various quantized versions.

Was this article helpful?