LLM 25-Day Course - Day 24: Local Model Serving

Day 24: Local Model Serving

Running LLMs on your own computer instead of relying on cloud APIs enables cost savings, data privacy, and offline usage. Today we compare the 4 representative tools for local serving.

Ollama: The Easiest Local LLM

Ollama works like Docker — you pull and run models, so you can start running LLMs immediately without configuration.

# After installing Ollama, download models (from the terminal)
# ollama pull llama3.1:8b
# ollama pull gemma2:9b
# ollama run llama3.1:8b  # Start chatting

# Call Ollama API from Python
import requests
import json

def ollama_generate(prompt, model="llama3.1:8b"):
    """Generate text via Ollama REST API"""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.7,
                "top_p": 0.9,
                "num_predict": 200,
            },
        },
    )
    result = response.json()
    return result["response"]

answer = ollama_generate("Tell me 3 advantages of Python.")
print(answer)

# OpenAI-compatible API (use existing code as-is)
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

vLLM: High-Performance GPU Serving

vLLM is a serving engine that maximizes throughput using PagedAttention technology. It excels when handling requests from multiple users simultaneously.

# pip install vllm

# Start vLLM server from terminal
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Python client
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "Please answer in English."},
        {"role": "user", "content": "Explain the difference between machine learning and deep learning."},
    ],
    temperature=0.7,
    max_tokens=300,
)
print(response.choices[0].message.content)

# vLLM advantages: continuous batching, PagedAttention, high throughput
# Most recommended for production serving if you have a GPU

llama.cpp: Lightweight CPU-Based Serving

You can run LLMs using only CPU without a GPU. It uses GGUF format quantized models.

# Install llama-cpp-python (CPU version)
# pip install llama-cpp-python

# GPU-accelerated version (CUDA)
# CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

from llama_cpp import Llama

# Load GGUF model
llm = Llama(
    model_path="./models/llama-3.1-8b-instruct-q4_k_m.gguf",
    n_ctx=4096,       # Context length
    n_threads=8,      # CPU thread count
    n_gpu_layers=0,   # GPU layers (0 means CPU only)
)

output = llm(
    "What is Python?",
    max_tokens=200,
    temperature=0.7,
    top_p=0.9,
    echo=False,
)
print(output["choices"][0]["text"])

Serving Tool Comparison Table

comparison = {
    "Tool":          ["Ollama",      "vLLM",        "llama.cpp",    "Text Gen WebUI"],
    "Difficulty":    ["Very easy",   "Medium",      "Medium",       "Easy"],
    "GPU required":  ["No",         "Yes",         "No",           "No"],
    "CPU support":   ["Yes",        "No",          "Yes (primary)","Yes"],
    "Throughput":    ["Medium",     "Very high",   "Low",          "Medium"],
    "API compat":    ["OpenAI compat","OpenAI compat","Custom API", "OpenAI compat"],
    "Recommended":   ["Personal use","Production",  "CPU server",   "Web UI experimentation"],
}

for key, vals in comparison.items():
    print(f"{key:14} | {'  |  '.join(vals)}")

# Selection guide:
# - Want to get started quickly -> Ollama
# - Have a GPU and need production serving -> vLLM
# - No GPU, CPU only -> llama.cpp
# - Want a convenient web UI -> Text Generation WebUI

Today’s Exercises

Install Ollama and download 3 or more models, then compare response quality and speed for the same question.
Write a simple chatbot script using Ollama’s OpenAI-compatible API. Implement multi-turn conversation that maintains conversation history.
When loading a GGUF model with llama.cpp, change n_gpu_layers to 0, 10, and full, measure inference speed (tokens/sec) for each, and compare.