LLM 25-Day Course - Day 20: PEFT Library in Practice

Day 20: PEFT Library in Practice

PEFT (Parameter-Efficient Fine-Tuning) is Hugging Face’s official library that makes it easy to apply LoRA, QLoRA, and more. Today we cover the entire process of applying LoRA to an actual model.

Setting Up LoraConfig

# pip install peft transformers bitsandbytes

from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit quantization configuration for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load base model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,    # Task type
    r=8,                              # Rank (low-rank dimension)
    lora_alpha=16,                    # Scaling factor
    lora_dropout=0.05,                # Dropout (prevents overfitting)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Layers to apply
    bias="none",                      # Whether to train bias
)

Creating PEFT Model and Checking Parameters

# Apply LoRA
peft_model = get_peft_model(model, lora_config)

# Check trainable parameters
peft_model.print_trainable_parameters()
# Output example: trainable params: 6,553,600 || all params: 8,036,098,048 || trainable%: 0.0816

# Check which layers have LoRA applied
for name, param in peft_model.named_parameters():
    if param.requires_grad:
        print(f"Trainable: {name} | shape: {param.shape}")

# Check LoRA layers in the model structure
print(peft_model)

Target Module Selection Guide

Which layers you apply LoRA to affects performance. The model’s Attention modules are the default targets, and including MLP layers can further improve performance.

# Find all Linear layer names in the model
def find_linear_modules(model):
    """Returns a list of Linear layers where LoRA can be applied"""
    linear_modules = set()
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            # Extract only the last part (e.g., model.layers.0.self_attn.q_proj -> q_proj)
            layer_name = name.split(".")[-1]
            linear_modules.add(layer_name)
    return list(linear_modules)

available_modules = find_linear_modules(model)
print(f"Available target modules: {available_modules}")

# General selection guide
target_guide = {
    "Minimal (fast training)":  ["q_proj", "v_proj"],
    "Recommended (balanced)":   ["q_proj", "v_proj", "k_proj", "o_proj"],
    "Maximum (high performance)": ["q_proj", "v_proj", "k_proj", "o_proj",
                                   "gate_proj", "up_proj", "down_proj"],
}
for level, modules in target_guide.items():
    print(f"\n[{level}]: {modules}")

Model Saving and Loading

LoRA adapters are saved separately from the base model. Adapter size is only tens of MBs, making sharing and version control easy.

from peft import PeftModel

# Save only the LoRA adapter (tens of MB)
peft_model.save_pretrained("./my_lora_adapter")

# Saved file structure
# ./my_lora_adapter/
#   adapter_config.json        (LoRA configuration)
#   adapter_model.safetensors  (Trained weights)

# Load again later
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
loaded_model = PeftModel.from_pretrained(base_model, "./my_lora_adapter")

# Switch to inference mode
loaded_model.eval()

# Merge adapter into base model (optional - improves inference speed)
merged_model = loaded_model.merge_and_unload()
merged_model.save_pretrained("./merged_model")

Today’s Exercises

Apply LoRA to a small model (e.g., gpt2 or distilgpt2) and check the trainable parameter ratio with print_trainable_parameters(). Compare by changing the target_modules.
Use the find_linear_modules() function to compare the Linear layer structures of 3 models with different architectures (GPT-2, BERT, T5).
Save a LoRA adapter then reload it, and verify that it produces the same output as before saving. Compare logits for the same input.