LLM 25-Day Course - Day 23: Quantization Guide

Day 23: Quantization Guide

Quantization is a technique that reduces model size and memory usage by lowering the precision of model weights. It is the key technology that makes it possible to run a 70B model on consumer GPUs.

Precision Level Comparison

The number of bits used to represent each number affects both the model’s size and performance. FP32 (32-bit) is the most accurate but the largest, while INT4 (4-bit) is the smallest but has slight quality loss.

# Model size and performance comparison by precision level
precision_table = {
    "Precision":   ["FP32",  "FP16",  "INT8",  "INT4"],
    "Bits/value":  ["32bit", "16bit", "8bit",  "4bit"],
    "7B size":     ["28GB",  "14GB",  "7GB",   "3.5GB"],
    "13B size":    ["52GB",  "26GB",  "13GB",  "6.5GB"],
    "70B size":    ["280GB", "140GB", "70GB",  "35GB"],
    "Quality loss": ["None", "Nearly none", "Minimal", "Slight"],
    "Inference speed": ["Slow", "Fast", "Very fast", "Fastest"],
}

for key, values in precision_table.items():
    print(f"{key:16} | {'  |  '.join(values)}")

Loading 4-bit/8-bit with bitsandbytes

The simplest quantization method is to quantize the model on the fly when loading it via bitsandbytes.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 8-bit quantization (INT8)
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    load_in_8bit=True,
    device_map="auto",
)
print(f"8-bit memory: {model_8bit.get_memory_footprint() / 1e9:.1f} GB")

# 4-bit quantization (NF4 - the method used in QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Double quantization for additional savings
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
print(f"4-bit memory: {model_4bit.get_memory_footprint() / 1e9:.1f} GB")

GGUF Format and llama.cpp

GGUF (GPT-Generated Unified Format) is the format used by llama.cpp, enabling LLM execution even on CPU. It provides various quantization levels (Q2_K ~ Q8_0).

# GGUF quantization using llama.cpp
# Step 1: Install llama.cpp
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp && make

# Step 2: Convert HF model to GGUF
# python convert_hf_to_gguf.py ../model_path --outtype f16 --outfile model.gguf

# Step 3: Run quantization
# ./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M

# GGUF quantization level comparison
gguf_levels = {
    "Quantization": ["Q2_K", "Q3_K_M", "Q4_K_M", "Q5_K_M", "Q6_K", "Q8_0"],
    "Bits":         ["2bit", "3bit",   "4bit",   "5bit",   "6bit", "8bit"],
    "7B size":      ["2.7GB","3.3GB",  "4.1GB",  "4.8GB",  "5.5GB","7.2GB"],
    "Quality":      ["Low",  "Fair",   "Good",   "Very good","Excellent","Near lossless"],
    "Recommended":  ["Not recommended","Limited","General use","High quality","High quality","When memory allows"],
}

for key, vals in gguf_levels.items():
    print(f"{key:14} | {'  |  '.join(vals)}")

AWQ and GPTQ Quantization

AWQ (Activation-aware Weight Quantization) and GPTQ quantize more precisely based on training data to minimize quality loss.

# Using AWQ quantized models (loading pre-quantized models)
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load AWQ model (already quantized on Hub)
model_awq = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-Chat-AWQ",
    device_map="auto",
)

# Load GPTQ model
# pip install auto-gptq
model_gptq = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-Chat-GPTQ",
    device_map="auto",
)

# Quantization method comparison summary
methods = {
    "Method":          ["bitsandbytes", "GGUF",          "AWQ",          "GPTQ"],
    "Speed":           ["On-load conversion","Pre-quantized","Pre-quantized","Pre-quantized"],
    "CPU support":     ["No",           "Yes",           "No",           "No"],
    "Inference speed":  ["Medium",       "Fast (CPU)",    "Fast (GPU)",   "Fast (GPU)"],
    "Fine-tuning support": ["Yes (QLoRA)","No",          "Limited",      "Limited"],
}
for key, vals in methods.items():
    print(f"{key:20} | {'  |  '.join(vals)}")

Today’s Exercises

Load the same model in FP16, INT8, and INT4, compare actual memory usage with get_memory_footprint(), and qualitatively evaluate the output quality for the same prompt.
Download a GGUF format model from Hugging Face Hub and run it with llama.cpp or Ollama to measure response speed.
Compare the outputs of Q4_K_M and Q8_0 quantized models on 10 prompts and analyze whether there are cases where the quality difference is significant.