Fine-tuning

LoRA

Low-Rank Adaptation. Fine-tune 70B models on consumer GPUs. Only train 1% of parameters, reduce memory by 3×.

2021
Paper
<1%
Trainable Params
Memory Reduction

Table of Contents

SECTION 01

The Fine-tuning Paradox

Modern LLMs are massive. A 70B parameter model requires ~140GB of VRAM just to store weights (float16). Training requires gradients (~140GB) + optimizer states (~280GB). Total: ~560GB. Only 8× H100 GPUs (192GB each) or TPv4 pods can handle it.

The Problem

Key Insight Most of a pre-trained model's knowledge is frozen. You're fine-tuning to adapt to a specific task (e.g., "answer like a doctor" or "write like Shakespeare"). You don't need to update all parameters—just a small adaptor.

The LoRA Solution Instead of updating W (70B parameters), train two small matrices A and B such that ΔW = AB^T. If A is (70K × r) and B is (70K × r) where r=8, you train only 1.1M parameters (0.0016% of 70B). Memory drops from 560GB to ~4GB.

Game Changer: LoRA brought fine-tuning from PhD-researcher-exclusive to consumer-GPU territory. Now you can fine-tune on a single RTX 4090 (24GB) instead of 8× H100.
SECTION 02

LoRA Math

The elegance of LoRA is simple linear algebra. No architectural changes—just parameter efficiency.

Low-Rank Decomposition

Instead of updating weight matrix W (d × k dimensions) directly, decompose the update:

ΔW = BA^T Where: - B ∈ R^(d × r) [d = input dim, r = rank (small, e.g., 8)] - A ∈ R^(k × r) [k = output dim] - Rank r << min(d, k) During forward pass: output = W @ x + (B @ A^T) @ x = W @ x + B @ (A^T @ x)

Parameter Count Comparison

Example: Llama 70B, we're adapting attention layers:

Why Low-Rank?

The paper hypothesizes that weight changes during adaptation are low-rank—the model's behavior shifts lie in a low-dimensional subspace. Empirically, rank 8–16 captures 90%+ of the benefit of full fine-tuning.

# Intuition: Why low-rank works # Pre-trained model: "Knows" language fundamentals # Fine-tuning task: "Adjust tone from neutral to sarcastic" # This adjustment is a simple projection: rotate some dimensions # Doesn't require rewriting all 70B params # LoRA captures this rotation with ~1M params instead of 70B

Rank Selection

Practical Rule: Start with r=8. If quality isn't good enough, try r=16. Rarely need higher. The paper reports that 8 captures 99% of the benefit of full fine-tuning in most cases.
SECTION 03

Which Layers to Adapt

You don't need to add LoRA to every layer. Strategic layer selection saves computation and improves results.

Standard Strategy: Attention Layers Only

# Typical LoRA config: Adapt only Q, K, V in attention # 32-layer model × 3 (Q, K, V) = 96 LoRA matrices # Per-layer: (4096 × 8) + (4096 × 8) per Q, K, V = ~98K params per layer # Total: 96 × 98K = ~9.4M params # vs full fine-tuning: 70B = 70,000M params # 7,400× parameter reduction!

When to Adapt More Layers

Empirical Results

Ablation studies show:

Configuration Recommendation: For most tasks, adapt Q and V. If quality isn't enough, add K. Rarely need output or MLP. This hits 95% quality with 1/5 the params.
SECTION 04

QLoRA

QLoRA (Quantization + LoRA) is a breakthrough: fine-tune 70B models on a single GPU with 4-bit quantization.

The Technique

NF4 Quantization

Instead of float32 (32-bit per weight), use 4-bit quantization:

# NF4 (Normalized Float 4-bit): Custom quantization for LLMs # - 16 possible values per weight (2^4) # - Optimized for LLM weight distributions (not uniform) # - Empirically, minimal quality loss vs full precision # Memory savings: # float32: 70B × 4 bytes = 280GB # NF4: 70B × 0.5 bytes = 35GB # (technically 14GB with packing)

QLoRA Recipe

from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model # Quantize to 4-bit bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, # Double quantization bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", quantization_config=bnb_config, device_map="auto" ) # Add LoRA on top lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none" ) model = get_peft_model(model, lora_config) # Now fine-tune normally trainer = SFTTrainer( model=model, train_dataset=dataset, args=SFTTrainingArguments(...) ) trainer.train()

Memory Usage: Before & After

QLoRA Impact: Democratized fine-tuning. Before: $10K+ cloud cost. After: $2-5K to buy a GPU. Enabled thousands of custom models.
SECTION 05

Training with PEFT + TRL

Complete end-to-end example using Hugging Face libraries.

Setup: Load Model + Add LoRA

import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model from datasets import load_dataset # Load model and tokenizer model_name = "meta-llama/Llama-2-7b" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Configure LoRA lora_config = LoraConfig( r=16, # Rank lora_alpha=32, # Scaling factor target_modules=["q_proj", "v_proj"], # Adapt Q, V in attention lora_dropout=0.1, bias="none", task_type="CAUSAL_LM" ) # Wrap model with LoRA model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 4,194,304 || all params: 6,738,415,616 # Only 0.062% trainable!

Training with SFTTrainer

from trl import SFTTrainer, SFTConfig # Load your dataset dataset = load_dataset("your-dataset", split="train") # Training config training_args = SFTConfig( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, optim="adamw_8bit", bf16=True, logging_steps=10, save_strategy="steps", save_steps=100, max_seq_length=512 ) # Train trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=training_args, formatting_func=format_dataset, # Your format function packing=True # Pack multiple examples per sequence ) trainer.train()

Inference with LoRA

from peft import AutoPeftModelForCausalLM # Load fine-tuned model model = AutoPeftModelForCausalLM.from_pretrained( "./output/checkpoint-100", device_map="auto", torch_dtype=torch.bfloat16 ) # Generate inputs = tokenizer("What is machine learning?", return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(tokenizer.decode(outputs[0]))
Training Best Practices: Use gradient_accumulation_steps to simulate larger batch sizes on small GPUs. Use packing=True to fit more data. Monitor with per_device_train_batch_size=4 and gradient_accumulation_steps=4 → effective batch 16.
SECTION 06

LoRA Variants

Since the original 2021 LoRA paper, variants have improved upon the technique.

Variant Key Innovation Trade-off When to Use
LoRA (Original) BA^T decomposition Baseline Start here; works for 95% of cases
LoRA+ Different learning rates for A vs B Slightly better quality, tuning overhead High-quality fine-tuning, extra care needed
AdaLoRA Adaptive rank: learn importance of each rank Automatically prune low-value params Parameter-constrained scenarios
DoRA Decompose into magnitude + direction Better stability, slightly more params When training is unstable
LoftQ QLoRA + better initialization from quantization Better quality QLoRA, no extra cost 4-bit quantization use cases

AdaLoRA: Adaptive Rank

Instead of fixed rank r, learn which dimensions matter:

from peft import AdaLoraConfig, get_peft_model config = AdaLoraConfig( init_r=16, # Start with rank 16 target_r=8, # Prune to rank 8 during training lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.1 ) model = get_peft_model(model, config) # Training will automatically prune unimportant rank dimensions # Result: fewer params with same quality

DoRA: Direction + Magnitude

from peft import LoraConfig # DoRA decomposes ΔW into magnitude (scalar) + direction (low-rank) # ΔW = mag * (B @ A^T / ||B @ A^T||) config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], use_dora=True, # Enable DoRA lora_dropout=0.1 ) model = get_peft_model(model, config) trainer.train() # More stable training, especially with large learning rates
When to Use Variants: Vanilla LoRA is rock-solid. Use AdaLoRA if you want automatic rank optimization. Use DoRA if training is unstable. Most projects: stick with original LoRA.
SECTION 07

Merging & Deployment

After training, you have two choices: deploy LoRA adapters separately, or merge into base model.

Option 1: Merge and Unload

Fuse LoRA weights back into base model. Single file, standard inference:

from peft import AutoPeftModelForCausalLM # Load LoRA model model = AutoPeftModelForCausalLM.from_pretrained( "./output/checkpoint-100", device_map="auto" ) # Merge LoRA into base weights merged_model = model.merge_and_unload() # Save as standard HF model merged_model.save_pretrained("./merged-model") merged_model.push_to_hub("username/my-finetuned-llama") # Now standard inference (no PEFT needed!) from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("./merged-model")

Option 2: Keep LoRA Separate

Store base model + small LoRA adapters. Useful for multiple task-specific adapters:

# Inference with separate LoRA from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( "./output", # Contains base model + adapter_config.json device_map="auto" ) # Generate with LoRA outputs = model.generate(...) # Can also switch between adapters at runtime from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") lora1 = PeftModel.from_pretrained(base_model, "./adapter-task1") lora2 = PeftModel.from_pretrained(base_model, "./adapter-task2") # Use lora1, switch to lora2, etc. Same base model

Multi-LoRA Serving

Serve multiple task-specific adapters from one base model:

# vLLM + LoRA support (experimental) from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-2-70b-hf", enable_lora=True, max_lora_rank=16, ) # Load multiple LoRA adapters llm.set_lora(adapter_names=["task1", "task2", "task3"]) # Route requests by adapter response = llm.generate( "prompt", sampling_params, lora_request=LoraRequest("task1") )

Comparison: Merge vs Separate

Deployment Choice: For single task → merge (simpler). For multiple tasks or rapid iteration → keep separate. Merging is one-line code and removes dependencies.
SECTION 08

LoRA Hyperparameter Guide

LoRA's four key hyperparameters — rank (r), alpha (α), dropout, and target modules — each have different sensitivity profiles. Rank is the most impactful: r=8 covers most instruction-following and style transfer tasks; r=16 is a safe default for domain adaptation; r=64 or higher is rarely worth the extra memory unless you are injecting genuinely new knowledge. Alpha controls the effective learning rate scaling (α/r = effective LR multiplier); keep α = 2r as a starting point.

Target modules matter as much as rank. For most transformer models, targeting q_proj and v_proj is sufficient for instruction following. For code generation or reasoning improvements, also add k_proj and o_proj. Adding MLP layers (gate_proj, up_proj, down_proj) yields further gains on knowledge-intensive tasks at the cost of 2–3× more trainable parameters.

Dropout (typically 0.05–0.1) provides regularisation but slows convergence slightly. Use it when your fine-tuning dataset is small (< 5,000 examples) or when you observe overfitting on the validation loss curve. For larger datasets, set dropout=0 and rely on early stopping instead. Always run a baseline with the default PEFT settings before sweeping hyperparameters — LoRA is robust enough that defaults work well for 80% of use cases.