Day 9: Meta Llama Series
Meta’s (formerly Facebook) Llama changed the landscape of open-source LLMs. Anyone can freely download, research, modify, and even use it commercially.
Llama Model History
The table below is a summary to help understand the major version timeline. Always check the model card and official documentation first for the latest detailed specs (context/license/recommended usage).
| Model | Size | Training Tokens | Context | License |
|---|---|---|---|---|
| Llama 2 | 7B / 13B / 70B | 2T | 4K | Llama 2 Community |
| Llama 3 | 8B / 70B | 15T | 8K | Llama 3 Community |
| Llama 3.1 | 8B / 70B / 405B | 15T+ | 128K | Llama 3.1 Community |
| Llama 3.2 | 1B / 3B / 11B / 90B | — | 128K | Llama 3.2 Community |
Key point: Llama 3 was trained on over 7 times more tokens than Llama 2. This strategy far exceeded the Chinchilla scaling law to boost the performance of smaller models.
Running Locally with Ollama
Ollama is the easiest tool for running LLMs locally.
# Step 1: Install Ollama (https://ollama.ai)
# Step 2: Download models from the terminal
# ollama pull llama3.1:8b
# ollama pull llama3.1:70b (requires 40GB+ VRAM)
# Step 3: Use in Python
# pip install ollama
import ollama
response = ollama.chat(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "Please answer in English."},
{"role": "user", "content": "Explain the quicksort algorithm."},
],
)
print(response["message"]["content"])
Downloading from HuggingFace
# pip install transformers torch accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# HuggingFace token required (https://huggingface.co/settings/tokens)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Memory saving (16-bit)
device_map="auto", # Automatic GPU placement
)
# Llama 3 chat format
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain recursive functions."},
]
input_ids = tokenizer.apply_chat_template(
messages, return_tensors="pt"
).to(model.device)
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Ollama REST API Usage
import requests
import json
# Ollama runs on localhost:11434 by default
def query_ollama(prompt, model="llama3.1:8b"):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.7,
"num_predict": 256,
},
},
)
return response.json()["response"]
# OpenAI-compatible API is also supported
def query_ollama_openai_compat(prompt, model="llama3.1:8b"):
response = requests.post(
"http://localhost:11434/v1/chat/completions",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
},
)
return response.json()["choices"][0]["message"]["content"]
result = query_ollama("What is the difference between lists and tuples in Python?")
print(result)
Hardware Requirements for Local Execution
| Model Size | VRAM Required (FP16) | VRAM Required (4bit) | Recommended GPU |
|---|---|---|---|
| 1B~3B | 2~6 GB | 1~2 GB | Integrated GPU possible |
| 8B | 16 GB | 5 GB | RTX 3060 or higher |
| 70B | 140 GB | 35~40 GB | A100 80GB or multi-GPU |
| 405B | 810 GB | 200+ GB | Cluster required |
Using 4-bit quantization can reduce VRAM requirements by approximately 1/4.
Today’s Exercises
- Install Ollama and download the
llama3.1:8bmodel to have a conversation. Record the response speed and language quality. - Send the same question to both
llama3.1:8band a commercial lightweight API model (latest), then compare response quality. What are the pros and cons of local models? - Read Llama 3’s license and summarize what restrictions exist for commercial use. (The 700 million monthly active users clause)