Practical Guide to LLM Fine-Tuning: From LoRA to SFT

What Is Fine-Tuning?

Fine-tuning is the process of additionally training a pre-trained LLM for a specific domain or task. Think of it like a medical school graduate (pre-training) going through a dermatology residency (fine-tuning). The key is to enhance expertise in a specific area while retaining general knowledge.

This post covers when to fine-tune, the available approaches, and hands-on code for training with LoRA/QLoRA.

Fine-Tuning vs Prompt Engineering vs RAG

There are three ways to improve LLM responses. You need to choose the right strategy for your situation.

MethodCostBest ForLimitations
Prompt EngineeringLowQuick prototyping, general tasksDifficult to maintain complex formats/tone
RAGMediumUp-to-date information, internal document-based answersDepends on retrieval quality
Fine-tuningHighDomain specialization, consistent styleRequires GPU, data preparation cost

When fine-tuning is needed: When prompts alone can’t consistently maintain the desired format/tone, when domain-specific terminology or knowledge is required, and when you need to reduce model response latency (solving through training instead of long prompts).

Comparing Fine-Tuning Approaches

Full Fine-Tuning vs PEFT

Full fine-tuning updates all model parameters. For a 7B model, the weights (FP16) are 14GB, and including optimizer states and gradients, approximately 60GB+ of GPU memory is required. In contrast, PEFT (Parameter-Efficient Fine-Tuning) trains only a small number of parameters, significantly reducing costs.

ApproachTrainable ParametersGPU Memory (7B)PerformanceFeatures
Full FT100%~60GB+BestUpdates all parameters
LoRA~0.1—1%~16—24GBNear Full FTInserts low-rank matrices
QLoRA~0.1—1%~12—20GBEqual to LoRA4-bit quantization + LoRA

How LoRA Works

LoRA (Low-Rank Adaptation) doesn’t modify the existing weight matrix W directly. Instead, it adds small low-rank matrices A and B for training. Expressed as a formula: W' = W + BA, where A and B are much smaller than the original matrix.

For example, applying rank=8 LoRA to a 4096x4096 matrix (~16.7 million parameters) means only 4096x8 + 8x4096 = 65,536 parameters are trained. That’s only 0.4% of the original.

Preparing the Dataset

The quality of fine-tuning depends on the dataset. Two main formats are used.

# Alpaca format: single-turn instruction-response
alpaca_data = [
    {
        "instruction": "Rewrite the following sentence in a polite tone.",
        "input": "Handle this quickly.",
        "output": "I would appreciate it if you could process this matter promptly."
    },
    {
        "instruction": "Fix the bug in the following code.",
        "input": "def add(a, b): return a - b",
        "output": "def add(a, b): return a + b  # Fixed subtraction to addition"
    }
]

# ShareGPT format: multi-turn conversation
sharegpt_data = [
    {
        "conversations": [
            {"from": "human", "value": "What's the difference between a list and a tuple in Python?"},
            {"from": "gpt", "value": "Lists are mutable while tuples are immutable..."},
            {"from": "human", "value": "When should I use a tuple then?"},
            {"from": "gpt", "value": "Use tuples when values should not be changed..."}
        ]
    }
]

Data quality checklist: prepare at least 500—1,000 samples, maintain consistent tone between instruction and output, remove duplicate data, and filter to avoid extreme differences in output length.

Hands-On LoRA Fine-Tuning Code

We’ll use Hugging Face’s peft and trl libraries to perform actual LoRA fine-tuning. The example below applies 4-bit quantization (QLoRA) so it can run on consumer GPUs.

# Install required packages
pip install transformers peft trl datasets bitsandbytes accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# 1. Configure 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16 # Use bfloat16 for computation
)

# 2. Load model and tokenizer
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 3. Configure LoRA
lora_config = LoraConfig(
    r=16,                  # Rank of low-rank matrices (higher = more expressiveness)
    lora_alpha=32,         # Scaling factor (typically 2x of r)
    lora_dropout=0.05,     # Dropout to prevent overfitting
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,043,384,832 || trainable%: 0.1695

# 4. Load dataset
dataset = load_dataset("json", data_files="train_data.json", split="train")

# 5. Run SFT training
training_config = SFTConfig(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 4 x 4 = 16
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_config,
    tokenizer=tokenizer,
)
trainer.train()

# 6. Save LoRA adapter (saves only the adapter, not the full model)
model.save_pretrained("./lora-adapter")

Here’s a summary of the key hyperparameters from the code above.

ParameterValueDescription
r (rank)8—64Higher values increase expressiveness but use more memory
lora_alpha1—2x of rScaling factor; the alpha/r ratio matters
learning_rate1e-4 to 3e-4Set higher than Full FT
num_train_epochs1—5Watch for overfitting; 3 is typical
gradient_accumulation_steps2—8Increase when GPU memory is limited

Using the Model After Training

Apply the saved LoRA adapter back to the base model for inference.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and apply LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Inference
inputs = tokenizer("What is a decorator in Python?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

LoRA adapters are typically only a few dozen MB, making storage/deployment far more efficient compared to the original model (several GB). You can also create multiple adapters for different purposes on a single base model and swap them out.

Summary

Fine-tuning is a powerful way to specialize an LLM for a specific domain. Here are the key takeaways:

  • Consider fine-tuning when prompts aren’t enough — try prompt engineering and RAG first
  • QLoRA enables training 7B models on 16—24GB GPUs (no A100 required)
  • Data quality is key — prepare 500—1,000+ consistent, high-quality samples
  • QLoRA: 4-bit quantization cuts memory by more than half while matching LoRA performance
  • Trained LoRA adapters are just a few dozen MB — efficient for storage, deployment, and swapping

Was this article helpful?