Day 23: Quantization Guide
Quantization is a technique that reduces model size and memory usage by lowering the precision of model weights. It is the key technology that makes it possible to run a 70B model on consumer GPUs.
Precision Level Comparison
The number of bits used to represent each number affects both the model’s size and performance. FP32 (32-bit) is the most accurate but the largest, while INT4 (4-bit) is the smallest but has slight quality loss.
# Model size and performance comparison by precision level
precision_table = {
"Precision": ["FP32", "FP16", "INT8", "INT4"],
"Bits/value": ["32bit", "16bit", "8bit", "4bit"],
"7B size": ["28GB", "14GB", "7GB", "3.5GB"],
"13B size": ["52GB", "26GB", "13GB", "6.5GB"],
"70B size": ["280GB", "140GB", "70GB", "35GB"],
"Quality loss": ["None", "Nearly none", "Minimal", "Slight"],
"Inference speed": ["Slow", "Fast", "Very fast", "Fastest"],
}
for key, values in precision_table.items():
print(f"{key:16} | {' | '.join(values)}")
Loading 4-bit/8-bit with bitsandbytes
The simplest quantization method is to quantize the model on the fly when loading it via bitsandbytes.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 8-bit quantization (INT8)
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
load_in_8bit=True,
device_map="auto",
)
print(f"8-bit memory: {model_8bit.get_memory_footprint() / 1e9:.1f} GB")
# 4-bit quantization (NF4 - the method used in QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Double quantization for additional savings
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
print(f"4-bit memory: {model_4bit.get_memory_footprint() / 1e9:.1f} GB")
GGUF Format and llama.cpp
GGUF (GPT-Generated Unified Format) is the format used by llama.cpp, enabling LLM execution even on CPU. It provides various quantization levels (Q2_K ~ Q8_0).
# GGUF quantization using llama.cpp
# Step 1: Install llama.cpp
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp && make
# Step 2: Convert HF model to GGUF
# python convert_hf_to_gguf.py ../model_path --outtype f16 --outfile model.gguf
# Step 3: Run quantization
# ./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M
# GGUF quantization level comparison
gguf_levels = {
"Quantization": ["Q2_K", "Q3_K_M", "Q4_K_M", "Q5_K_M", "Q6_K", "Q8_0"],
"Bits": ["2bit", "3bit", "4bit", "5bit", "6bit", "8bit"],
"7B size": ["2.7GB","3.3GB", "4.1GB", "4.8GB", "5.5GB","7.2GB"],
"Quality": ["Low", "Fair", "Good", "Very good","Excellent","Near lossless"],
"Recommended": ["Not recommended","Limited","General use","High quality","High quality","When memory allows"],
}
for key, vals in gguf_levels.items():
print(f"{key:14} | {' | '.join(vals)}")
AWQ and GPTQ Quantization
AWQ (Activation-aware Weight Quantization) and GPTQ quantize more precisely based on training data to minimize quality loss.
# Using AWQ quantized models (loading pre-quantized models)
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load AWQ model (already quantized on Hub)
model_awq = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-Chat-AWQ",
device_map="auto",
)
# Load GPTQ model
# pip install auto-gptq
model_gptq = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-Chat-GPTQ",
device_map="auto",
)
# Quantization method comparison summary
methods = {
"Method": ["bitsandbytes", "GGUF", "AWQ", "GPTQ"],
"Speed": ["On-load conversion","Pre-quantized","Pre-quantized","Pre-quantized"],
"CPU support": ["No", "Yes", "No", "No"],
"Inference speed": ["Medium", "Fast (CPU)", "Fast (GPU)", "Fast (GPU)"],
"Fine-tuning support": ["Yes (QLoRA)","No", "Limited", "Limited"],
}
for key, vals in methods.items():
print(f"{key:20} | {' | '.join(vals)}")
Today’s Exercises
- Load the same model in FP16, INT8, and INT4, compare actual memory usage with
get_memory_footprint(), and qualitatively evaluate the output quality for the same prompt. - Download a GGUF format model from Hugging Face Hub and run it with llama.cpp or Ollama to measure response speed.
- Compare the outputs of Q4_K_M and Q8_0 quantized models on 10 prompts and analyze whether there are cases where the quality difference is significant.