What Is Fine-Tuning?
Fine-tuning is the process of additionally training a pre-trained LLM for a specific domain or task. Think of it like a medical school graduate (pre-training) going through a dermatology residency (fine-tuning). The key is to enhance expertise in a specific area while retaining general knowledge.
This post covers when to fine-tune, the available approaches, and hands-on code for training with LoRA/QLoRA.
Fine-Tuning vs Prompt Engineering vs RAG
There are three ways to improve LLM responses. You need to choose the right strategy for your situation.
| Method | Cost | Best For | Limitations |
|---|---|---|---|
| Prompt Engineering | Low | Quick prototyping, general tasks | Difficult to maintain complex formats/tone |
| RAG | Medium | Up-to-date information, internal document-based answers | Depends on retrieval quality |
| Fine-tuning | High | Domain specialization, consistent style | Requires GPU, data preparation cost |
When fine-tuning is needed: When prompts alone can’t consistently maintain the desired format/tone, when domain-specific terminology or knowledge is required, and when you need to reduce model response latency (solving through training instead of long prompts).
Comparing Fine-Tuning Approaches
Full Fine-Tuning vs PEFT
Full fine-tuning updates all model parameters. For a 7B model, the weights (FP16) are 14GB, and including optimizer states and gradients, approximately 60GB+ of GPU memory is required. In contrast, PEFT (Parameter-Efficient Fine-Tuning) trains only a small number of parameters, significantly reducing costs.
| Approach | Trainable Parameters | GPU Memory (7B) | Performance | Features |
|---|---|---|---|---|
| Full FT | 100% | ~60GB+ | Best | Updates all parameters |
| LoRA | ~0.1—1% | ~16—24GB | Near Full FT | Inserts low-rank matrices |
| QLoRA | ~0.1—1% | ~12—20GB | Equal to LoRA | 4-bit quantization + LoRA |
How LoRA Works
LoRA (Low-Rank Adaptation) doesn’t modify the existing weight matrix W directly. Instead, it adds small low-rank matrices A and B for training. Expressed as a formula: W' = W + BA, where A and B are much smaller than the original matrix.
For example, applying rank=8 LoRA to a 4096x4096 matrix (~16.7 million parameters) means only 4096x8 + 8x4096 = 65,536 parameters are trained. That’s only 0.4% of the original.
Preparing the Dataset
The quality of fine-tuning depends on the dataset. Two main formats are used.
# Alpaca format: single-turn instruction-response
alpaca_data = [
{
"instruction": "Rewrite the following sentence in a polite tone.",
"input": "Handle this quickly.",
"output": "I would appreciate it if you could process this matter promptly."
},
{
"instruction": "Fix the bug in the following code.",
"input": "def add(a, b): return a - b",
"output": "def add(a, b): return a + b # Fixed subtraction to addition"
}
]
# ShareGPT format: multi-turn conversation
sharegpt_data = [
{
"conversations": [
{"from": "human", "value": "What's the difference between a list and a tuple in Python?"},
{"from": "gpt", "value": "Lists are mutable while tuples are immutable..."},
{"from": "human", "value": "When should I use a tuple then?"},
{"from": "gpt", "value": "Use tuples when values should not be changed..."}
]
}
]
Data quality checklist: prepare at least 500—1,000 samples, maintain consistent tone between instruction and output, remove duplicate data, and filter to avoid extreme differences in output length.
Hands-On LoRA Fine-Tuning Code
We’ll use Hugging Face’s peft and trl libraries to perform actual LoRA fine-tuning. The example below applies 4-bit quantization (QLoRA) so it can run on consumer GPUs.
# Install required packages
pip install transformers peft trl datasets bitsandbytes accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
# 1. Configure 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 quantization
bnb_4bit_compute_dtype=torch.bfloat16 # Use bfloat16 for computation
)
# 2. Load model and tokenizer
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 3. Configure LoRA
lora_config = LoraConfig(
r=16, # Rank of low-rank matrices (higher = more expressiveness)
lora_alpha=32, # Scaling factor (typically 2x of r)
lora_dropout=0.05, # Dropout to prevent overfitting
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,043,384,832 || trainable%: 0.1695
# 4. Load dataset
dataset = load_dataset("json", data_files="train_data.json", split="train")
# 5. Run SFT training
training_config = SFTConfig(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 x 4 = 16
learning_rate=2e-4,
logging_steps=10,
save_strategy="epoch",
bf16=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_config,
tokenizer=tokenizer,
)
trainer.train()
# 6. Save LoRA adapter (saves only the adapter, not the full model)
model.save_pretrained("./lora-adapter")
Here’s a summary of the key hyperparameters from the code above.
| Parameter | Value | Description |
|---|---|---|
r (rank) | 8—64 | Higher values increase expressiveness but use more memory |
lora_alpha | 1—2x of r | Scaling factor; the alpha/r ratio matters |
learning_rate | 1e-4 to 3e-4 | Set higher than Full FT |
num_train_epochs | 1—5 | Watch for overfitting; 3 is typical |
gradient_accumulation_steps | 2—8 | Increase when GPU memory is limited |
Using the Model After Training
Apply the saved LoRA adapter back to the base model for inference.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and apply LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# Inference
inputs = tokenizer("What is a decorator in Python?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
LoRA adapters are typically only a few dozen MB, making storage/deployment far more efficient compared to the original model (several GB). You can also create multiple adapters for different purposes on a single base model and swap them out.
Summary
Fine-tuning is a powerful way to specialize an LLM for a specific domain. Here are the key takeaways:
- Consider fine-tuning when prompts aren’t enough — try prompt engineering and RAG first
- QLoRA enables training 7B models on 16—24GB GPUs (no A100 required)
- Data quality is key — prepare 500—1,000+ consistent, high-quality samples
- QLoRA: 4-bit quantization cuts memory by more than half while matching LoRA performance
- Trained LoRA adapters are just a few dozen MB — efficient for storage, deployment, and swapping