LLM 25-Day Course - Day 19: Understanding LoRA and QLoRA

Day 19: Understanding LoRA and QLoRA

LoRA is a technique that efficiently fine-tunes by adding small matrix pairs instead of directly modifying the weights of a large model. QLoRA combines this with 4-bit quantization to save even more memory.

Core Principles of LoRA

When updating the original weight matrix W (d x d), Full FT modifies the entire W. LoRA freezes W and trains only a low-rank matrix decomposition W + BA (B: d x r, A: r x d, r is much smaller than d). The smaller the rank r, the fewer the parameters.

import torch
import torch.nn as nn

# Understanding the core concept of LoRA with simple code
class LoRALayer(nn.Module):
    """LoRA adapter: freeze original weights, train only low-rank matrices"""

    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        self.original = original_layer
        self.original.weight.requires_grad = False  # Freeze original weights

        # Low-rank matrices A, B (training targets)
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        self.scaling = alpha / rank  # Scaling factor

        # A initialized with normal distribution, B with zeros (identical to original model initially)
        nn.init.kaiming_normal_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)

    def forward(self, x):
        original_output = self.original(x)
        lora_output = self.lora_B(self.lora_A(x)) * self.scaling
        return original_output + lora_output

# Parameter count comparison
d = 4096  # Typical hidden size of an LLM
rank = 8
original_params = d * d                    # 16,777,216
lora_params = (d * rank) + (rank * d)      # 65,536
print(f"Original parameters: {original_params:,}")
print(f"LoRA parameters: {lora_params:,} ({lora_params/original_params*100:.2f}%)")

Understanding rank and alpha Parameters

# Relationship between rank and alpha, and recommended values
configs = [
    {"rank": 4,  "alpha": 8,  "use_case": "Light style changes"},
    {"rank": 8,  "alpha": 16, "use_case": "General fine-tuning (default recommendation)"},
    {"rank": 16, "alpha": 32, "use_case": "Complex tasks, domain adaptation"},
    {"rank": 64, "alpha": 128,"use_case": "When near Full FT performance is needed"},
]

for cfg in configs:
    scaling = cfg["alpha"] / cfg["rank"]
    d = 4096
    params = 2 * d * cfg["rank"]
    print(f"rank={cfg['rank']:3d}, alpha={cfg['alpha']:3d}, "
          f"scaling={scaling:.1f}, params={params:,}, "
          f"use case: {cfg['use_case']}")

# Key rules:
# - Higher rank = more expressiveness, more memory
# - alpha is usually set to 2x rank (alpha = 2 * rank)
# - scaling = alpha / rank, this value controls the LoRA learning rate

QLoRA: 4-bit Quantization + LoRA

QLoRA drastically reduces memory by quantizing the base model to 4-bit, then applying LoRA on top. This makes it possible to fine-tune a 7B model on consumer GPUs (RTX 3090/4090).

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization configuration (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",             # NormalFloat4 type (recommended)
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
    bnb_4bit_use_double_quant=True,        # Double quantization (additional memory savings)
)

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Memory usage comparison table
memory_comparison = {
    "Precision":  ["FP32", "FP16", "INT8", "INT4(QLoRA)"],
    "7B model":   ["28GB", "14GB", "7GB",  "3.5GB"],
    "13B model":  ["52GB", "26GB", "13GB", "6.5GB"],
    "70B model":  ["280GB","140GB","70GB", "35GB"],
}
for key, vals in memory_comparison.items():
    print(f"{key:12} | {'  |  '.join(vals)}")

With QLoRA, you can fine-tune a 7B model with approximately 6GB of VRAM. This is sufficient even for an RTX 3060 (12GB).

Today’s Exercises

Extend the LoRALayer class above to change rank to 4, 8, 16, and 32, then organize the number of trainable parameters and the ratio to total parameters in a table.
Measure the actual memory usage difference between load_in_8bit=True and load_in_4bit=True in BitsAndBytesConfig. Use model.get_memory_footprint().
Research how the alpha/rank ratio (scaling) in LoRA affects training, and document what problems occur when scaling is too high or too low.