SECTION 01
The Fine-tuning Paradox
Modern LLMs are massive. A 70B parameter model requires ~140GB of VRAM just to store weights (float16). Training requires gradients (~140GB) + optimizer states (~280GB). Total: ~560GB. Only 8× H100 GPUs (192GB each) or TPv4 pods can handle it.
The Problem
- GPU requirements: 8× H100 = ~$160K upfront, or $12K+/month on cloud
- Access: Most researchers/companies can't afford this
- Iteration speed: 8-GPU training takes hours. Slow feedback loops.
- Parameter efficiency: Full fine-tuning updates ALL 70B params. Overkill for most tasks.
Key Insight Most of a pre-trained model's knowledge is frozen. You're fine-tuning to adapt to a specific task (e.g., "answer like a doctor" or "write like Shakespeare"). You don't need to update all parameters—just a small adaptor.
The LoRA Solution Instead of updating W (70B parameters), train two small matrices A and B such that ΔW = AB^T. If A is (70K × r) and B is (70K × r) where r=8, you train only 1.1M parameters (0.0016% of 70B). Memory drops from 560GB to ~4GB.
Game Changer: LoRA brought fine-tuning from PhD-researcher-exclusive to consumer-GPU territory. Now you can fine-tune on a single RTX 4090 (24GB) instead of 8× H100.
SECTION 02
LoRA Math
The elegance of LoRA is simple linear algebra. No architectural changes—just parameter efficiency.
Low-Rank Decomposition
Instead of updating weight matrix W (d × k dimensions) directly, decompose the update:
ΔW = BA^T
Where:
- B ∈ R^(d × r) [d = input dim, r = rank (small, e.g., 8)]
- A ∈ R^(k × r) [k = output dim]
- Rank r << min(d, k)
During forward pass:
output = W @ x + (B @ A^T) @ x
= W @ x + B @ (A^T @ x)
Parameter Count Comparison
Example: Llama 70B, we're adapting attention layers:
- Full fine-tuning: ~70B parameters to train
- LoRA with rank=8: Attention layers only (~10B params total), so 2 × (d × r + k × r). For typical sizes, ~1.1M trainable params
- Savings: 70B / 1.1M = 64,000× fewer parameters!
Why Low-Rank?
The paper hypothesizes that weight changes during adaptation are low-rank—the model's behavior shifts lie in a low-dimensional subspace. Empirically, rank 8–16 captures 90%+ of the benefit of full fine-tuning.
# Intuition: Why low-rank works
# Pre-trained model: "Knows" language fundamentals
# Fine-tuning task: "Adjust tone from neutral to sarcastic"
# This adjustment is a simple projection: rotate some dimensions
# Doesn't require rewriting all 70B params
# LoRA captures this rotation with ~1M params instead of 70B
Rank Selection
- r=8: Fast training, <1M params, usually sufficient
- r=16: Slightly better quality, ~2M params, still fast
- r=32: Diminishing returns, ~4M params, slower
- r=64: Rarely needed, approaches full fine-tuning
Practical Rule: Start with r=8. If quality isn't good enough, try r=16. Rarely need higher. The paper reports that 8 captures 99% of the benefit of full fine-tuning in most cases.
SECTION 03
Which Layers to Adapt
You don't need to add LoRA to every layer. Strategic layer selection saves computation and improves results.
Standard Strategy: Attention Layers Only
- Query/Key/Value projections: These control attention patterns. Most task-specific adaptation happens here.
- Output projection: Optional, sometimes helps but adds overhead
- MLP layers: Often frozen; they're task-agnostic
# Typical LoRA config: Adapt only Q, K, V in attention
# 32-layer model × 3 (Q, K, V) = 96 LoRA matrices
# Per-layer: (4096 × 8) + (4096 × 8) per Q, K, V = ~98K params per layer
# Total: 96 × 98K = ~9.4M params
# vs full fine-tuning: 70B = 70,000M params
# 7,400× parameter reduction!
When to Adapt More Layers
- Complex domain adaptation: Fine-tuning to specialized domain (legal, medical). Adapt Q, K, V, output projection, maybe MLP
- Language generation: Generate in new style (poetry, technical writing). Attention layers usually sufficient
- Classification: Categorize domain-specific content. Q, K, V usually enough
Empirical Results
Ablation studies show:
- LoRA on Q only: ~70% of full fine-tuning quality
- LoRA on Q, V: ~90% of quality
- LoRA on Q, K, V: ~95% of quality
- LoRA on all (Q, K, V, output, MLP): ~99% of quality, but 10× more params
Configuration Recommendation: For most tasks, adapt Q and V. If quality isn't enough, add K. Rarely need output or MLP. This hits 95% quality with 1/5 the params.
SECTION 04
QLoRA
QLoRA (Quantization + LoRA) is a breakthrough: fine-tune 70B models on a single GPU with 4-bit quantization.
The Technique
- Quantize base model to 4-bit NF4: 70B model → ~14GB (4× compression)
- Add LoRA adapters in float32: ~1M params = ~4MB
- Train using gradient checkpointing: Save memory during backprop
- Result: 70B model training on RTX 4090 (24GB) with room to spare
NF4 Quantization
Instead of float32 (32-bit per weight), use 4-bit quantization:
# NF4 (Normalized Float 4-bit): Custom quantization for LLMs
# - 16 possible values per weight (2^4)
# - Optimized for LLM weight distributions (not uniform)
# - Empirically, minimal quality loss vs full precision
# Memory savings:
# float32: 70B × 4 bytes = 280GB
# NF4: 70B × 0.5 bytes = 35GB
# (technically 14GB with packing)
QLoRA Recipe
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# Quantize to 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # Double quantization
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Add LoRA on top
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, lora_config)
# Now fine-tune normally
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=SFTTrainingArguments(...)
)
trainer.train()
Memory Usage: Before & After
- Full 70B fine-tuning: 560GB+ (impossible on consumer hardware)
- LoRA only: ~80GB (still needs 8× H100)
- QLoRA: ~24GB (single RTX 4090!)
QLoRA Impact: Democratized fine-tuning. Before: $10K+ cloud cost. After: $2-5K to buy a GPU. Enabled thousands of custom models.
SECTION 05
Training with PEFT + TRL
Complete end-to-end example using Hugging Face libraries.
Setup: Load Model + Add LoRA
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Adapt Q, V in attention
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
# Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,738,415,616
# Only 0.062% trainable!
Training with SFTTrainer
from trl import SFTTrainer, SFTConfig
# Load your dataset
dataset = load_dataset("your-dataset", split="train")
# Training config
training_args = SFTConfig(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
optim="adamw_8bit",
bf16=True,
logging_steps=10,
save_strategy="steps",
save_steps=100,
max_seq_length=512
)
# Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=training_args,
formatting_func=format_dataset, # Your format function
packing=True # Pack multiple examples per sequence
)
trainer.train()
Inference with LoRA
from peft import AutoPeftModelForCausalLM
# Load fine-tuned model
model = AutoPeftModelForCausalLM.from_pretrained(
"./output/checkpoint-100",
device_map="auto",
torch_dtype=torch.bfloat16
)
# Generate
inputs = tokenizer("What is machine learning?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
Training Best Practices: Use gradient_accumulation_steps to simulate larger batch sizes on small GPUs. Use packing=True to fit more data. Monitor with per_device_train_batch_size=4 and gradient_accumulation_steps=4 → effective batch 16.
SECTION 06
LoRA Variants
Since the original 2021 LoRA paper, variants have improved upon the technique.
| Variant |
Key Innovation |
Trade-off |
When to Use |
| LoRA (Original) |
BA^T decomposition |
Baseline |
Start here; works for 95% of cases |
| LoRA+ |
Different learning rates for A vs B |
Slightly better quality, tuning overhead |
High-quality fine-tuning, extra care needed |
| AdaLoRA |
Adaptive rank: learn importance of each rank |
Automatically prune low-value params |
Parameter-constrained scenarios |
| DoRA |
Decompose into magnitude + direction |
Better stability, slightly more params |
When training is unstable |
| LoftQ |
QLoRA + better initialization from quantization |
Better quality QLoRA, no extra cost |
4-bit quantization use cases |
AdaLoRA: Adaptive Rank
Instead of fixed rank r, learn which dimensions matter:
from peft import AdaLoraConfig, get_peft_model
config = AdaLoraConfig(
init_r=16, # Start with rank 16
target_r=8, # Prune to rank 8 during training
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1
)
model = get_peft_model(model, config)
# Training will automatically prune unimportant rank dimensions
# Result: fewer params with same quality
DoRA: Direction + Magnitude
from peft import LoraConfig
# DoRA decomposes ΔW into magnitude (scalar) + direction (low-rank)
# ΔW = mag * (B @ A^T / ||B @ A^T||)
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
use_dora=True, # Enable DoRA
lora_dropout=0.1
)
model = get_peft_model(model, config)
trainer.train()
# More stable training, especially with large learning rates
When to Use Variants: Vanilla LoRA is rock-solid. Use AdaLoRA if you want automatic rank optimization. Use DoRA if training is unstable. Most projects: stick with original LoRA.
SECTION 07
Merging & Deployment
After training, you have two choices: deploy LoRA adapters separately, or merge into base model.
Option 1: Merge and Unload
Fuse LoRA weights back into base model. Single file, standard inference:
from peft import AutoPeftModelForCausalLM
# Load LoRA model
model = AutoPeftModelForCausalLM.from_pretrained(
"./output/checkpoint-100",
device_map="auto"
)
# Merge LoRA into base weights
merged_model = model.merge_and_unload()
# Save as standard HF model
merged_model.save_pretrained("./merged-model")
merged_model.push_to_hub("username/my-finetuned-llama")
# Now standard inference (no PEFT needed!)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./merged-model")
Option 2: Keep LoRA Separate
Store base model + small LoRA adapters. Useful for multiple task-specific adapters:
# Inference with separate LoRA
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
"./output", # Contains base model + adapter_config.json
device_map="auto"
)
# Generate with LoRA
outputs = model.generate(...)
# Can also switch between adapters at runtime
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
lora1 = PeftModel.from_pretrained(base_model, "./adapter-task1")
lora2 = PeftModel.from_pretrained(base_model, "./adapter-task2")
# Use lora1, switch to lora2, etc. Same base model
Multi-LoRA Serving
Serve multiple task-specific adapters from one base model:
# vLLM + LoRA support (experimental)
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
enable_lora=True,
max_lora_rank=16,
)
# Load multiple LoRA adapters
llm.set_lora(adapter_names=["task1", "task2", "task3"])
# Route requests by adapter
response = llm.generate(
"prompt",
sampling_params,
lora_request=LoraRequest("task1")
)
Comparison: Merge vs Separate
- Merge: Single file, standard inference, slightly slower (~5%), no PEFT dependency
- Separate: Smaller files (~40MB per adapter vs 14GB base), multi-adapter support, requires PEFT at inference
Deployment Choice: For single task → merge (simpler). For multiple tasks or rapid iteration → keep separate. Merging is one-line code and removes dependencies.