Fine-tuning · Concept guide

Parameter-Efficient Fine-Tuning

Adapt large language models by training less than 1% of parameters — LoRA, QLoRA, prefix tuning, and adapters compared.

<1% params trained
4-bit with QLoRA
~90% of full fine-tune quality
4 main methods
Contents
  1. Why PEFT exists
  2. Method comparison
  3. LoRA in depth
  4. QLoRA: 4-bit + LoRA
  5. Choosing your method
  6. Tools & libraries
  7. References
01 — The Problem

Why PEFT Exists

Full fine-tuning requires updating all weights — for a 7B model that's ~28GB of gradients in FP32, plus optimizer states (Adam stores 2 copies), pushing total training memory above 100GB. PEFT methods freeze the base model and train only a small number of additional parameters. The base model weights never change — you train an adapter on top.

💡 Key insight: A LoRA adapter for a 7B model can be as small as 30MB. You can swap adapters at serve time without reloading the base model — one GPU, many specialized behaviours.
02 — Overview

Method Comparison

MethodParams trainedMemory overheadQuality vs full FTInference overheadBest for
LoRA<1%Low~90%None (merged)General task adaptation
QLoRA<1%Very low (4-bit base)~85–90%None (merged)Consumer GPU fine-tuning
Prefix tuning0.1%Very low~80%Small (prefix tokens)Style/tone adaptation
Adapter layers1–4%Low–medium~90%Small (extra layers)Multi-task serving

All PEFT methods maintain the same inference quality as LoRA-merged adapters, but trade off parameter efficiency against serving flexibility. LoRA dominates for single-task deployment; adapters excel in multi-task scenarios.

03 — Core Technique

LoRA in Depth

LoRA (Low-Rank Adaptation) decomposes the weight update ΔW into two low-rank matrices: ΔW = B × A, where A ∈ ℝ^(r×d) and B ∈ ℝ^(d×r). Only A and B are trained. At inference, BA is merged back into the original weight — zero overhead.

Rank Selection Guide

1

Rank 4–8 — minimal

Style and tone adaptation, simple instruction following. Minimal storage. For lightweight tasks where you're not reshaping the model's reasoning.

Python · LoRA fine-tuning with PEFT + TRL
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import Dataset
import torch

# QLoRA: load base in 4-bit, train LoRA adapters in fp16
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map="auto"
)

# LoRA configuration: target attention projection matrices
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,             # rank — higher = more capacity, more params
    lora_alpha=32,    # scaling factor (typically 2×r)
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,254,540,288 || trainable%: 0.19

# Training data in chat format
train_data = Dataset.from_list([
    {"messages": [
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "Gradient descent is an iterative optimization..."}
    ]}
])

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    args=SFTConfig(
        output_dir="./lora-output",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=True,
        max_seq_length=512
    ),
    tokenizer=tokenizer
)
trainer.train()
model.save_pretrained("./lora-adapter")  # saves only the ~80MB adapter
2

Rank 16 — default

Default starting point. Good for most task adaptation use cases. Balances expressiveness against storage. Start here unless you have a reason not to.

3

Rank 32–64 — complex tasks

Complex reasoning, domain adaptation, knowledge injection. Gives better quality at modest cost. Use if rank 16 plateaus.

4

Rank 128+ — rare

Rarely needed. Diminishing returns; approaching full fine-tune cost. Only if you've validated that quality improves and memory is unconstrained.

⚠️ Target modules matter as much as rank. Apply LoRA to q_proj and v_proj at minimum. Adding k_proj, o_proj, and the MLP layers consistently improves quality at modest cost.
04 — Extreme Efficiency

QLoRA: 4-bit + LoRA

QLoRA quantizes the frozen base model to NF4 (4-bit NormalFloat) and trains LoRA adapters on top in BF16. A 65B model that required 780GB in FP32 fits on a single 48GB GPU. Three key innovations: NF4 quantization, double quantization (quantizing the quantization constants), and paged optimizers.

Memory Comparison

Model Full FT (FP32) LoRA (BF16) QLoRA (NF4) 7B 112 GB ~28 GB ~6 GB 13B 208 GB ~52 GB ~10 GB 70B 1,120 GB ~280 GB ~48 GB

QLoRA trades a small loss in convergence speed and final quality (~5%) for the ability to fine-tune billion-parameter models on consumer hardware. It's the most practical PEFT method for researchers without enterprise budgets.

💡 When to use QLoRA: You have a 13B+ model and a single GPU under 48GB VRAM. Quality loss is minimal; feasibility gain is massive.
Python · Merge LoRA adapter into base model for deployment
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
adapter_path   = "./lora-adapter"

# Load base model in full precision for merging
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="cpu"   # merge on CPU to avoid OOM on smaller GPUs
)

# Load PEFT model (base + adapter)
peft_model = PeftModel.from_pretrained(base_model, adapter_path)

# Merge adapter weights into base model
print("Merging adapter into base model...")
merged_model = peft_model.merge_and_unload()
print("Merge complete.")

# Save merged model — now a standalone model without PEFT dependency
merged_model.save_pretrained("./merged-model", safe_serialization=True)
tokenizer.save_pretrained("./merged-model")
print("Saved merged model to ./merged-model")

# The merged model:
# - Has ZERO runtime overhead vs base model
# - Can be served with any standard inference framework
# - Is indistinguishable from a full fine-tuned model
05 — Selection

Choosing Your Method

💻 Consumer GPU (< 24GB)

  • QLoRA — only practical option for large models
  • Enables fine-tuning 13B+ on a 3090/4090
  • Minimal quality penalty

🚀 Production Serving

  • LoRA merged — zero inference overhead after merge
  • Clearest performance characteristics
  • Simplest deployment pipeline

🔀 Multi-Task (One Base, Many Tasks)

  • Adapter layers — swap at runtime without reloading
  • Each adapter is orthogonal
  • Modest runtime overhead

✏️ Style/Tone Only

  • Prefix tuning — fewest params, fastest to train
  • Good for instruction following tweaks
  • Skip if you need knowledge injection

Decision Tree

Start with LoRA rank 16. If you hit GPU memory limits, use QLoRA. If you need to serve multiple specialized models simultaneously without reloading, use adapter layers. For cosmetic changes only (style, tone, instruction format), try prefix tuning first.

06 — Tooling

Tools & Libraries

HuggingFace PEFT
Core library
Canonical implementation. Supports all methods, integrates with Trainer.
Unsloth
Speed
2× faster LoRA training. Optimized CUDA kernels. Free tier available.
Axolotl
Config-driven
YAML-based fine-tuning orchestration. Supports QLoRA, multi-GPU distributed training.
LLaMA-Factory
Web UI + CLI
Web dashboard + command-line interface. Broad model support, includes quantization.
TRL (HuggingFace)
Advanced
SFT, DPO, PPO trainers built on PEFT. Full reinforcement learning pipeline.
DeepSpeed
Scale
Distributed training framework. PEFT-compatible for multi-GPU fine-tuning.
06 — Production

Production Deployment of PEFT Models

PEFT adapters are small — a LoRA adapter for a 7B model might be 40–80MB while the base model is 14GB. This creates a deployment opportunity: host one base model and hot-swap adapters per tenant or task. The PEFT library and vLLM both support multi-LoRA serving with dynamic adapter loading.

Key considerations: adapter compatibility (adapters are version-locked to the base model checkpoint), merging vs dynamic loading (merge for single-purpose deployment to eliminate serving overhead, dynamic loading for multi-task scenarios), and adapter versioning (treat adapters like model artifacts — version control, evaluation gates before promotion).

MethodTrainable ParamsMemory OverheadQuality vs Full FTBest For
LoRA0.1–1%~40–80MB adapter~95%Domain adaptation, style
QLoRA0.1–1%~40MB + 4-bit base~92%Fine-tuning on consumer GPU
Prefix Tuning<0.1%Minimal~85%Prompt-style tasks
IA³<0.05%Minimal~88%Few-shot task specialization
Full Fine-Tuning100%Full model copy100% (ceiling)Maximum quality, large budget
07 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing