LoRA

The Fine-tuning Paradox
LoRA Math
Which Layers to Adapt
QLoRA
Training with PEFT + TRL
LoRA Variants
Merging & Deployment

SECTION 01

The Fine-tuning Paradox

Modern LLMs are massive. A 70B parameter model requires ~140GB of VRAM just to store weights (float16). Training requires gradients (~140GB) + optimizer states (~280GB). Total: ~560GB. Only 8× H100 GPUs (192GB each) or TPv4 pods can handle it.

The Problem

GPU requirements: 8× H100 = ~$160K upfront, or $12K+/month on cloud
Access: Most researchers/companies can't afford this
Iteration speed: 8-GPU training takes hours. Slow feedback loops.
Parameter efficiency: Full fine-tuning updates ALL 70B params. Overkill for most tasks.

Key Insight Most of a pre-trained model's knowledge is frozen. You're fine-tuning to adapt to a specific task (e.g., "answer like a doctor" or "write like Shakespeare"). You don't need to update all parameters—just a small adaptor.

The LoRA Solution Instead of updating W (70B parameters), train two small matrices A and B such that ΔW = AB^T. If A is (70K × r) and B is (70K × r) where r=8, you train only 1.1M parameters (0.0016% of 70B). Memory drops from 560GB to ~4GB.

Game Changer: LoRA brought fine-tuning from PhD-researcher-exclusive to consumer-GPU territory. Now you can fine-tune on a single RTX 4090 (24GB) instead of 8× H100.

SECTION 02

LoRA Math

The elegance of LoRA is simple linear algebra. No architectural changes—just parameter efficiency.

Low-Rank Decomposition

Instead of updating weight matrix W (d × k dimensions) directly, decompose the update:

ΔW = BA^T Where: - B ∈ R^(d × r) [d = input dim, r = rank (small, e.g., 8)] - A ∈ R^(k × r) [k = output dim] - Rank r << min(d, k) During forward pass: output = W @ x + (B @ A^T) @ x = W @ x + B @ (A^T @ x)

Parameter Count Comparison

Example: Llama 70B, we're adapting attention layers:

Full fine-tuning: ~70B parameters to train
LoRA with rank=8: Attention layers only (~10B params total), so 2 × (d × r + k × r). For typical sizes, ~1.1M trainable params
Savings: 70B / 1.1M = 64,000× fewer parameters!

Why Low-Rank?

The paper hypothesizes that weight changes during adaptation are low-rank—the model's behavior shifts lie in a low-dimensional subspace. Empirically, rank 8–16 captures 90%+ of the benefit of full fine-tuning.

# Intuition: Why low-rank works # Pre-trained model: "Knows" language fundamentals # Fine-tuning task: "Adjust tone from neutral to sarcastic" # This adjustment is a simple projection: rotate some dimensions # Doesn't require rewriting all 70B params # LoRA captures this rotation with ~1M params instead of 70B

Rank Selection

r=8: Fast training, <1M params, usually sufficient
r=16: Slightly better quality, ~2M params, still fast
r=32: Diminishing returns, ~4M params, slower
r=64: Rarely needed, approaches full fine-tuning

Practical Rule: Start with r=8. If quality isn't good enough, try r=16. Rarely need higher. The paper reports that 8 captures 99% of the benefit of full fine-tuning in most cases.

SECTION 03

Which Layers to Adapt

You don't need to add LoRA to every layer. Strategic layer selection saves computation and improves results.

Standard Strategy: Attention Layers Only

Query/Key/Value projections: These control attention patterns. Most task-specific adaptation happens here.
Output projection: Optional, sometimes helps but adds overhead
MLP layers: Often frozen; they're task-agnostic

# Typical LoRA config: Adapt only Q, K, V in attention # 32-layer model × 3 (Q, K, V) = 96 LoRA matrices # Per-layer: (4096 × 8) + (4096 × 8) per Q, K, V = ~98K params per layer # Total: 96 × 98K = ~9.4M params # vs full fine-tuning: 70B = 70,000M params # 7,400× parameter reduction!

When to Adapt More Layers

Complex domain adaptation: Fine-tuning to specialized domain (legal, medical). Adapt Q, K, V, output projection, maybe MLP
Language generation: Generate in new style (poetry, technical writing). Attention layers usually sufficient
Classification: Categorize domain-specific content. Q, K, V usually enough

Empirical Results

Ablation studies show:

LoRA on Q only: ~70% of full fine-tuning quality
LoRA on Q, V: ~90% of quality
LoRA on Q, K, V: ~95% of quality
LoRA on all (Q, K, V, output, MLP): ~99% of quality, but 10× more params

Configuration Recommendation: For most tasks, adapt Q and V. If quality isn't enough, add K. Rarely need output or MLP. This hits 95% quality with 1/5 the params.

SECTION 04

QLoRA

QLoRA (Quantization + LoRA) is a breakthrough: fine-tune 70B models on a single GPU with 4-bit quantization.

The Technique

Quantize base model to 4-bit NF4: 70B model → ~14GB (4× compression)
Add LoRA adapters in float32: ~1M params = ~4MB
Train using gradient checkpointing: Save memory during backprop
Result: 70B model training on RTX 4090 (24GB) with room to spare

NF4 Quantization

Instead of float32 (32-bit per weight), use 4-bit quantization:

# NF4 (Normalized Float 4-bit): Custom quantization for LLMs # - 16 possible values per weight (2^4) # - Optimized for LLM weight distributions (not uniform) # - Empirically, minimal quality loss vs full precision # Memory savings: # float32: 70B × 4 bytes = 280GB # NF4: 70B × 0.5 bytes = 35GB # (technically 14GB with packing)

QLoRA Recipe

from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model # Quantize to 4-bit bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, # Double quantization bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", quantization_config=bnb_config, device_map="auto" ) # Add LoRA on top lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none" ) model = get_peft_model(model, lora_config) # Now fine-tune normally trainer = SFTTrainer( model=model, train_dataset=dataset, args=SFTTrainingArguments(...) ) trainer.train()

Memory Usage: Before & After

Full 70B fine-tuning: 560GB+ (impossible on consumer hardware)
LoRA only: ~80GB (still needs 8× H100)
QLoRA: ~24GB (single RTX 4090!)

QLoRA Impact: Democratized fine-tuning. Before: $10K+ cloud cost. After: $2-5K to buy a GPU. Enabled thousands of custom models.

SECTION 05

Training with PEFT + TRL

Complete end-to-end example using Hugging Face libraries.

Setup: Load Model + Add LoRA

import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model from datasets import load_dataset # Load model and tokenizer model_name = "meta-llama/Llama-2-7b" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Configure LoRA lora_config = LoraConfig( r=16, # Rank lora_alpha=32, # Scaling factor target_modules=["q_proj", "v_proj"], # Adapt Q, V in attention lora_dropout=0.1, bias="none", task_type="CAUSAL_LM" ) # Wrap model with LoRA model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 4,194,304 || all params: 6,738,415,616 # Only 0.062% trainable!

Training with SFTTrainer

from trl import SFTTrainer, SFTConfig # Load your dataset dataset = load_dataset("your-dataset", split="train") # Training config training_args = SFTConfig( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, optim="adamw_8bit", bf16=True, logging_steps=10, save_strategy="steps", save_steps=100, max_seq_length=512 ) # Train trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=training_args, formatting_func=format_dataset, # Your format function packing=True # Pack multiple examples per sequence ) trainer.train()

Inference with LoRA

from peft import AutoPeftModelForCausalLM # Load fine-tuned model model = AutoPeftModelForCausalLM.from_pretrained( "./output/checkpoint-100", device_map="auto", torch_dtype=torch.bfloat16 ) # Generate inputs = tokenizer("What is machine learning?", return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(tokenizer.decode(outputs[0]))

Training Best Practices: Use gradient_accumulation_steps to simulate larger batch sizes on small GPUs. Use packing=True to fit more data. Monitor with per_device_train_batch_size=4 and gradient_accumulation_steps=4 → effective batch 16.

SECTION 06

LoRA Variants

Since the original 2021 LoRA paper, variants have improved upon the technique.

Variant	Key Innovation	Trade-off	When to Use
LoRA (Original)	BA^T decomposition	Baseline	Start here; works for 95% of cases
LoRA+	Different learning rates for A vs B	Slightly better quality, tuning overhead	High-quality fine-tuning, extra care needed
AdaLoRA	Adaptive rank: learn importance of each rank	Automatically prune low-value params	Parameter-constrained scenarios
DoRA	Decompose into magnitude + direction	Better stability, slightly more params	When training is unstable
LoftQ	QLoRA + better initialization from quantization	Better quality QLoRA, no extra cost	4-bit quantization use cases

AdaLoRA: Adaptive Rank

Instead of fixed rank r, learn which dimensions matter:

from peft import AdaLoraConfig, get_peft_model config = AdaLoraConfig( init_r=16, # Start with rank 16 target_r=8, # Prune to rank 8 during training lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.1 ) model = get_peft_model(model, config) # Training will automatically prune unimportant rank dimensions # Result: fewer params with same quality

DoRA: Direction + Magnitude

from peft import LoraConfig # DoRA decomposes ΔW into magnitude (scalar) + direction (low-rank) # ΔW = mag * (B @ A^T / ||B @ A^T||) config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], use_dora=True, # Enable DoRA lora_dropout=0.1 ) model = get_peft_model(model, config) trainer.train() # More stable training, especially with large learning rates

When to Use Variants: Vanilla LoRA is rock-solid. Use AdaLoRA if you want automatic rank optimization. Use DoRA if training is unstable. Most projects: stick with original LoRA.

SECTION 07

Merging & Deployment

After training, you have two choices: deploy LoRA adapters separately, or merge into base model.

Option 1: Merge and Unload

Fuse LoRA weights back into base model. Single file, standard inference:

from peft import AutoPeftModelForCausalLM # Load LoRA model model = AutoPeftModelForCausalLM.from_pretrained( "./output/checkpoint-100", device_map="auto" ) # Merge LoRA into base weights merged_model = model.merge_and_unload() # Save as standard HF model merged_model.save_pretrained("./merged-model") merged_model.push_to_hub("username/my-finetuned-llama") # Now standard inference (no PEFT needed!) from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("./merged-model")

Option 2: Keep LoRA Separate

Store base model + small LoRA adapters. Useful for multiple task-specific adapters:

# Inference with separate LoRA from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( "./output", # Contains base model + adapter_config.json device_map="auto" ) # Generate with LoRA outputs = model.generate(...) # Can also switch between adapters at runtime from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") lora1 = PeftModel.from_pretrained(base_model, "./adapter-task1") lora2 = PeftModel.from_pretrained(base_model, "./adapter-task2") # Use lora1, switch to lora2, etc. Same base model

Multi-LoRA Serving

Serve multiple task-specific adapters from one base model:

# vLLM + LoRA support (experimental) from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-2-70b-hf", enable_lora=True, max_lora_rank=16, ) # Load multiple LoRA adapters llm.set_lora(adapter_names=["task1", "task2", "task3"]) # Route requests by adapter response = llm.generate( "prompt", sampling_params, lora_request=LoraRequest("task1") )

Comparison: Merge vs Separate

Merge: Single file, standard inference, slightly slower (~5%), no PEFT dependency
Separate: Smaller files (~40MB per adapter vs 14GB base), multi-adapter support, requires PEFT at inference

Deployment Choice: For single task → merge (simpler). For multiple tasks or rapid iteration → keep separate. Merging is one-line code and removes dependencies.

Table of Contents

The Fine-tuning Paradox

LoRA Math

Which Layers to Adapt

QLoRA

Training with PEFT + TRL

LoRA Variants

Merging & Deployment

LoRA Hyperparameter Guide

LoRA

Table of Contents

The Fine-tuning Paradox

LoRA Math

Which Layers to Adapt

QLoRA

Training with PEFT + TRL

LoRA Variants

Merging & Deployment

LoRA Hyperparameter Guide

Related concepts