LLM 25-Day Course - Day 9: Meta Llama Series

Day 9: Meta Llama Series

Meta’s (formerly Facebook) Llama changed the landscape of open-source LLMs. Anyone can freely download, research, modify, and even use it commercially.

Llama Model History

The table below is a summary to help understand the major version timeline. Always check the model card and official documentation first for the latest detailed specs (context/license/recommended usage).

ModelSizeTraining TokensContextLicense
Llama 27B / 13B / 70B2T4KLlama 2 Community
Llama 38B / 70B15T8KLlama 3 Community
Llama 3.18B / 70B / 405B15T+128KLlama 3.1 Community
Llama 3.21B / 3B / 11B / 90B128KLlama 3.2 Community

Key point: Llama 3 was trained on over 7 times more tokens than Llama 2. This strategy far exceeded the Chinchilla scaling law to boost the performance of smaller models.

Running Locally with Ollama

Ollama is the easiest tool for running LLMs locally.

# Step 1: Install Ollama (https://ollama.ai)
# Step 2: Download models from the terminal
#   ollama pull llama3.1:8b
#   ollama pull llama3.1:70b  (requires 40GB+ VRAM)

# Step 3: Use in Python
# pip install ollama
import ollama

response = ollama.chat(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "Please answer in English."},
        {"role": "user", "content": "Explain the quicksort algorithm."},
    ],
)

print(response["message"]["content"])

Downloading from HuggingFace

# pip install transformers torch accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# HuggingFace token required (https://huggingface.co/settings/tokens)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,   # Memory saving (16-bit)
    device_map="auto",           # Automatic GPU placement
)

# Llama 3 chat format
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain recursive functions."},
]

input_ids = tokenizer.apply_chat_template(
    messages, return_tensors="pt"
).to(model.device)

output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Ollama REST API Usage

import requests
import json

# Ollama runs on localhost:11434 by default
def query_ollama(prompt, model="llama3.1:8b"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.7,
                "num_predict": 256,
            },
        },
    )
    return response.json()["response"]

# OpenAI-compatible API is also supported
def query_ollama_openai_compat(prompt, model="llama3.1:8b"):
    response = requests.post(
        "http://localhost:11434/v1/chat/completions",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
        },
    )
    return response.json()["choices"][0]["message"]["content"]

result = query_ollama("What is the difference between lists and tuples in Python?")
print(result)

Hardware Requirements for Local Execution

Model SizeVRAM Required (FP16)VRAM Required (4bit)Recommended GPU
1B~3B2~6 GB1~2 GBIntegrated GPU possible
8B16 GB5 GBRTX 3060 or higher
70B140 GB35~40 GBA100 80GB or multi-GPU
405B810 GB200+ GBCluster required

Using 4-bit quantization can reduce VRAM requirements by approximately 1/4.

Today’s Exercises

  1. Install Ollama and download the llama3.1:8b model to have a conversation. Record the response speed and language quality.
  2. Send the same question to both llama3.1:8b and a commercial lightweight API model (latest), then compare response quality. What are the pros and cons of local models?
  3. Read Llama 3’s license and summarize what restrictions exist for commercial use. (The 700 million monthly active users clause)

Was this article helpful?