Adapt large language models by training less than 1% of parameters — LoRA, QLoRA, prefix tuning, and adapters compared.
Full fine-tuning requires updating all weights — for a 7B model that's ~28GB of gradients in FP32, plus optimizer states (Adam stores 2 copies), pushing total training memory above 100GB. PEFT methods freeze the base model and train only a small number of additional parameters. The base model weights never change — you train an adapter on top.
| Method | Params trained | Memory overhead | Quality vs full FT | Inference overhead | Best for |
|---|---|---|---|---|---|
| LoRA | <1% | Low | ~90% | None (merged) | General task adaptation |
| QLoRA | <1% | Very low (4-bit base) | ~85–90% | None (merged) | Consumer GPU fine-tuning |
| Prefix tuning | 0.1% | Very low | ~80% | Small (prefix tokens) | Style/tone adaptation |
| Adapter layers | 1–4% | Low–medium | ~90% | Small (extra layers) | Multi-task serving |
All PEFT methods maintain the same inference quality as LoRA-merged adapters, but trade off parameter efficiency against serving flexibility. LoRA dominates for single-task deployment; adapters excel in multi-task scenarios.
LoRA (Low-Rank Adaptation) decomposes the weight update ΔW into two low-rank matrices: ΔW = B × A, where A ∈ ℝ^(r×d) and B ∈ ℝ^(d×r). Only A and B are trained. At inference, BA is merged back into the original weight — zero overhead.
Style and tone adaptation, simple instruction following. Minimal storage. For lightweight tasks where you're not reshaping the model's reasoning.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import Dataset
import torch
# QLoRA: load base in 4-bit, train LoRA adapters in fp16
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config, device_map="auto"
)
# LoRA configuration: target attention projection matrices
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — higher = more capacity, more params
lora_alpha=32, # scaling factor (typically 2×r)
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,254,540,288 || trainable%: 0.19
# Training data in chat format
train_data = Dataset.from_list([
{"messages": [
{"role": "user", "content": "What is gradient descent?"},
{"role": "assistant", "content": "Gradient descent is an iterative optimization..."}
]}
])
trainer = SFTTrainer(
model=model,
train_dataset=train_data,
args=SFTConfig(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
max_seq_length=512
),
tokenizer=tokenizer
)
trainer.train()
model.save_pretrained("./lora-adapter") # saves only the ~80MB adapter
Default starting point. Good for most task adaptation use cases. Balances expressiveness against storage. Start here unless you have a reason not to.
Complex reasoning, domain adaptation, knowledge injection. Gives better quality at modest cost. Use if rank 16 plateaus.
Rarely needed. Diminishing returns; approaching full fine-tune cost. Only if you've validated that quality improves and memory is unconstrained.
QLoRA quantizes the frozen base model to NF4 (4-bit NormalFloat) and trains LoRA adapters on top in BF16. A 65B model that required 780GB in FP32 fits on a single 48GB GPU. Three key innovations: NF4 quantization, double quantization (quantizing the quantization constants), and paged optimizers.
QLoRA trades a small loss in convergence speed and final quality (~5%) for the ability to fine-tune billion-parameter models on consumer hardware. It's the most practical PEFT method for researchers without enterprise budgets.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
adapter_path = "./lora-adapter"
# Load base model in full precision for merging
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="cpu" # merge on CPU to avoid OOM on smaller GPUs
)
# Load PEFT model (base + adapter)
peft_model = PeftModel.from_pretrained(base_model, adapter_path)
# Merge adapter weights into base model
print("Merging adapter into base model...")
merged_model = peft_model.merge_and_unload()
print("Merge complete.")
# Save merged model — now a standalone model without PEFT dependency
merged_model.save_pretrained("./merged-model", safe_serialization=True)
tokenizer.save_pretrained("./merged-model")
print("Saved merged model to ./merged-model")
# The merged model:
# - Has ZERO runtime overhead vs base model
# - Can be served with any standard inference framework
# - Is indistinguishable from a full fine-tuned model
Start with LoRA rank 16. If you hit GPU memory limits, use QLoRA. If you need to serve multiple specialized models simultaneously without reloading, use adapter layers. For cosmetic changes only (style, tone, instruction format), try prefix tuning first.
PEFT adapters are small — a LoRA adapter for a 7B model might be 40–80MB while the base model is 14GB. This creates a deployment opportunity: host one base model and hot-swap adapters per tenant or task. The PEFT library and vLLM both support multi-LoRA serving with dynamic adapter loading.
Key considerations: adapter compatibility (adapters are version-locked to the base model checkpoint), merging vs dynamic loading (merge for single-purpose deployment to eliminate serving overhead, dynamic loading for multi-task scenarios), and adapter versioning (treat adapters like model artifacts — version control, evaluation gates before promotion).
| Method | Trainable Params | Memory Overhead | Quality vs Full FT | Best For |
|---|---|---|---|---|
| LoRA | 0.1–1% | ~40–80MB adapter | ~95% | Domain adaptation, style |
| QLoRA | 0.1–1% | ~40MB + 4-bit base | ~92% | Fine-tuning on consumer GPU |
| Prefix Tuning | <0.1% | Minimal | ~85% | Prompt-style tasks |
| IA³ | <0.05% | Minimal | ~88% | Few-shot task specialization |
| Full Fine-Tuning | 100% | Full model copy | 100% (ceiling) | Maximum quality, large budget |