LLM 25-Day Course - Day 9: Meta Llama Series

Day 9: Meta Llama Series

Meta’s (formerly Facebook) Llama changed the landscape of open-source LLMs. Anyone can freely download, research, modify, and even use it commercially.

Llama Model History

The table below is a summary to help understand the major version timeline. Always check the model card and official documentation first for the latest detailed specs (context/license/recommended usage).

Model	Size	Training Tokens	Context	License
Llama 2	7B / 13B / 70B	2T	4K	Llama 2 Community
Llama 3	8B / 70B	15T	8K	Llama 3 Community
Llama 3.1	8B / 70B / 405B	15T+	128K	Llama 3.1 Community
Llama 3.2	1B / 3B / 11B / 90B	—	128K	Llama 3.2 Community

Key point: Llama 3 was trained on over 7 times more tokens than Llama 2. This strategy far exceeded the Chinchilla scaling law to boost the performance of smaller models.

Running Locally with Ollama

Ollama is the easiest tool for running LLMs locally.

# Step 1: Install Ollama (https://ollama.ai)
# Step 2: Download models from the terminal
#   ollama pull llama3.1:8b
#   ollama pull llama3.1:70b  (requires 40GB+ VRAM)

# Step 3: Use in Python
# pip install ollama
import ollama

response = ollama.chat(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "Please answer in English."},
        {"role": "user", "content": "Explain the quicksort algorithm."},
    ],
)

print(response["message"]["content"])

Downloading from HuggingFace

# pip install transformers torch accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# HuggingFace token required (https://huggingface.co/settings/tokens)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,   # Memory saving (16-bit)
    device_map="auto",           # Automatic GPU placement
)

# Llama 3 chat format
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain recursive functions."},
]

input_ids = tokenizer.apply_chat_template(
    messages, return_tensors="pt"
).to(model.device)

output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Ollama REST API Usage

import requests
import json

# Ollama runs on localhost:11434 by default
def query_ollama(prompt, model="llama3.1:8b"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.7,
                "num_predict": 256,
            },
        },
    )
    return response.json()["response"]

# OpenAI-compatible API is also supported
def query_ollama_openai_compat(prompt, model="llama3.1:8b"):
    response = requests.post(
        "http://localhost:11434/v1/chat/completions",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
        },
    )
    return response.json()["choices"][0]["message"]["content"]

result = query_ollama("What is the difference between lists and tuples in Python?")
print(result)

Hardware Requirements for Local Execution

Model Size	VRAM Required (FP16)	VRAM Required (4bit)	Recommended GPU
1B~3B	2~6 GB	1~2 GB	Integrated GPU possible
8B	16 GB	5 GB	RTX 3060 or higher
70B	140 GB	35~40 GB	A100 80GB or multi-GPU
405B	810 GB	200+ GB	Cluster required

Using 4-bit quantization can reduce VRAM requirements by approximately 1/4.

Today’s Exercises

Install Ollama and download the llama3.1:8b model to have a conversation. Record the response speed and language quality.
Send the same question to both llama3.1:8b and a commercial lightweight API model (latest), then compare response quality. What are the pros and cons of local models?
Read Llama 3’s license and summarize what restrictions exist for commercial use. (The 700 million monthly active users clause)