Day 24: Local Model Serving
Running LLMs on your own computer instead of relying on cloud APIs enables cost savings, data privacy, and offline usage. Today we compare the 4 representative tools for local serving.
Ollama: The Easiest Local LLM
Ollama works like Docker — you pull and run models, so you can start running LLMs immediately without configuration.
# After installing Ollama, download models (from the terminal)
# ollama pull llama3.1:8b
# ollama pull gemma2:9b
# ollama run llama3.1:8b # Start chatting
# Call Ollama API from Python
import requests
import json
def ollama_generate(prompt, model="llama3.1:8b"):
"""Generate text via Ollama REST API"""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.7,
"top_p": 0.9,
"num_predict": 200,
},
},
)
result = response.json()
return result["response"]
answer = ollama_generate("Tell me 3 advantages of Python.")
print(answer)
# OpenAI-compatible API (use existing code as-is)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
vLLM: High-Performance GPU Serving
vLLM is a serving engine that maximizes throughput using PagedAttention technology. It excels when handling requests from multiple users simultaneously.
# pip install vllm
# Start vLLM server from terminal
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
# Python client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "Please answer in English."},
{"role": "user", "content": "Explain the difference between machine learning and deep learning."},
],
temperature=0.7,
max_tokens=300,
)
print(response.choices[0].message.content)
# vLLM advantages: continuous batching, PagedAttention, high throughput
# Most recommended for production serving if you have a GPU
llama.cpp: Lightweight CPU-Based Serving
You can run LLMs using only CPU without a GPU. It uses GGUF format quantized models.
# Install llama-cpp-python (CPU version)
# pip install llama-cpp-python
# GPU-accelerated version (CUDA)
# CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
from llama_cpp import Llama
# Load GGUF model
llm = Llama(
model_path="./models/llama-3.1-8b-instruct-q4_k_m.gguf",
n_ctx=4096, # Context length
n_threads=8, # CPU thread count
n_gpu_layers=0, # GPU layers (0 means CPU only)
)
output = llm(
"What is Python?",
max_tokens=200,
temperature=0.7,
top_p=0.9,
echo=False,
)
print(output["choices"][0]["text"])
Serving Tool Comparison Table
comparison = {
"Tool": ["Ollama", "vLLM", "llama.cpp", "Text Gen WebUI"],
"Difficulty": ["Very easy", "Medium", "Medium", "Easy"],
"GPU required": ["No", "Yes", "No", "No"],
"CPU support": ["Yes", "No", "Yes (primary)","Yes"],
"Throughput": ["Medium", "Very high", "Low", "Medium"],
"API compat": ["OpenAI compat","OpenAI compat","Custom API", "OpenAI compat"],
"Recommended": ["Personal use","Production", "CPU server", "Web UI experimentation"],
}
for key, vals in comparison.items():
print(f"{key:14} | {' | '.join(vals)}")
# Selection guide:
# - Want to get started quickly -> Ollama
# - Have a GPU and need production serving -> vLLM
# - No GPU, CPU only -> llama.cpp
# - Want a convenient web UI -> Text Generation WebUI
Today’s Exercises
- Install Ollama and download 3 or more models, then compare response quality and speed for the same question.
- Write a simple chatbot script using Ollama’s OpenAI-compatible API. Implement multi-turn conversation that maintains conversation history.
- When loading a GGUF model with llama.cpp, change
n_gpu_layersto 0, 10, and full, measure inference speed (tokens/sec) for each, and compare.