Day 19: Understanding LoRA and QLoRA
LoRA is a technique that efficiently fine-tunes by adding small matrix pairs instead of directly modifying the weights of a large model. QLoRA combines this with 4-bit quantization to save even more memory.
Core Principles of LoRA
When updating the original weight matrix W (d x d), Full FT modifies the entire W. LoRA freezes W and trains only a low-rank matrix decomposition W + BA (B: d x r, A: r x d, r is much smaller than d). The smaller the rank r, the fewer the parameters.
import torch
import torch.nn as nn
# Understanding the core concept of LoRA with simple code
class LoRALayer(nn.Module):
"""LoRA adapter: freeze original weights, train only low-rank matrices"""
def __init__(self, original_layer, rank=8, alpha=16):
super().__init__()
in_features = original_layer.in_features
out_features = original_layer.out_features
self.original = original_layer
self.original.weight.requires_grad = False # Freeze original weights
# Low-rank matrices A, B (training targets)
self.lora_A = nn.Linear(in_features, rank, bias=False)
self.lora_B = nn.Linear(rank, out_features, bias=False)
self.scaling = alpha / rank # Scaling factor
# A initialized with normal distribution, B with zeros (identical to original model initially)
nn.init.kaiming_normal_(self.lora_A.weight)
nn.init.zeros_(self.lora_B.weight)
def forward(self, x):
original_output = self.original(x)
lora_output = self.lora_B(self.lora_A(x)) * self.scaling
return original_output + lora_output
# Parameter count comparison
d = 4096 # Typical hidden size of an LLM
rank = 8
original_params = d * d # 16,777,216
lora_params = (d * rank) + (rank * d) # 65,536
print(f"Original parameters: {original_params:,}")
print(f"LoRA parameters: {lora_params:,} ({lora_params/original_params*100:.2f}%)")
Understanding rank and alpha Parameters
# Relationship between rank and alpha, and recommended values
configs = [
{"rank": 4, "alpha": 8, "use_case": "Light style changes"},
{"rank": 8, "alpha": 16, "use_case": "General fine-tuning (default recommendation)"},
{"rank": 16, "alpha": 32, "use_case": "Complex tasks, domain adaptation"},
{"rank": 64, "alpha": 128,"use_case": "When near Full FT performance is needed"},
]
for cfg in configs:
scaling = cfg["alpha"] / cfg["rank"]
d = 4096
params = 2 * d * cfg["rank"]
print(f"rank={cfg['rank']:3d}, alpha={cfg['alpha']:3d}, "
f"scaling={scaling:.1f}, params={params:,}, "
f"use case: {cfg['use_case']}")
# Key rules:
# - Higher rank = more expressiveness, more memory
# - alpha is usually set to 2x rank (alpha = 2 * rank)
# - scaling = alpha / rank, this value controls the LoRA learning rate
QLoRA: 4-bit Quantization + LoRA
QLoRA drastically reduces memory by quantizing the base model to 4-bit, then applying LoRA on top. This makes it possible to fine-tune a 7B model on consumer GPUs (RTX 3090/4090).
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization configuration (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_quant_type="nf4", # NormalFloat4 type (recommended)
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
bnb_4bit_use_double_quant=True, # Double quantization (additional memory savings)
)
# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
# Memory usage comparison table
memory_comparison = {
"Precision": ["FP32", "FP16", "INT8", "INT4(QLoRA)"],
"7B model": ["28GB", "14GB", "7GB", "3.5GB"],
"13B model": ["52GB", "26GB", "13GB", "6.5GB"],
"70B model": ["280GB","140GB","70GB", "35GB"],
}
for key, vals in memory_comparison.items():
print(f"{key:12} | {' | '.join(vals)}")
With QLoRA, you can fine-tune a 7B model with approximately 6GB of VRAM. This is sufficient even for an RTX 3060 (12GB).
Today’s Exercises
- Extend the
LoRALayerclass above to change rank to 4, 8, 16, and 32, then organize the number of trainable parameters and the ratio to total parameters in a table. - Measure the actual memory usage difference between
load_in_8bit=Trueandload_in_4bit=TrueinBitsAndBytesConfig. Usemodel.get_memory_footprint(). - Research how the alpha/rank ratio (scaling) in LoRA affects training, and document what problems occur when scaling is too high or too low.