Parameter-Efficient Fine-Tuning

Contents

Why PEFT exists
Method comparison
LoRA in depth
QLoRA: 4-bit + LoRA
Choosing your method
Tools & libraries
References

01 — The Problem

Why PEFT Exists

Full fine-tuning requires updating all weights — for a 7B model that's ~28GB of gradients in FP32, plus optimizer states (Adam stores 2 copies), pushing total training memory above 100GB. PEFT methods freeze the base model and train only a small number of additional parameters. The base model weights never change — you train an adapter on top.

💡 Key insight: A LoRA adapter for a 7B model can be as small as 30MB. You can swap adapters at serve time without reloading the base model — one GPU, many specialized behaviours.

02 — Overview

Method Comparison

Method	Params trained	Memory overhead	Quality vs full FT	Inference overhead	Best for
LoRA	<1%	Low	~90%	None (merged)	General task adaptation
QLoRA	<1%	Very low (4-bit base)	~85–90%	None (merged)	Consumer GPU fine-tuning
Prefix tuning	0.1%	Very low	~80%	Small (prefix tokens)	Style/tone adaptation
Adapter layers	1–4%	Low–medium	~90%	Small (extra layers)	Multi-task serving

All PEFT methods maintain the same inference quality as LoRA-merged adapters, but trade off parameter efficiency against serving flexibility. LoRA dominates for single-task deployment; adapters excel in multi-task scenarios.

03 — Core Technique

LoRA in Depth

LoRA (Low-Rank Adaptation) decomposes the weight update ΔW into two low-rank matrices: ΔW = B × A, where A ∈ ℝ^(r×d) and B ∈ ℝ^(d×r). Only A and B are trained. At inference, BA is merged back into the original weight — zero overhead.

Rank Selection Guide

Rank 4–8 — minimal

Style and tone adaptation, simple instruction following. Minimal storage. For lightweight tasks where you're not reshaping the model's reasoning.

Python · LoRA fine-tuning with PEFT + TRL

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import Dataset
import torch

# QLoRA: load base in 4-bit, train LoRA adapters in fp16
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map="auto"
)

# LoRA configuration: target attention projection matrices
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,             # rank — higher = more capacity, more params
    lora_alpha=32,    # scaling factor (typically 2×r)
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,254,540,288 || trainable%: 0.19

# Training data in chat format
train_data = Dataset.from_list([
    {"messages": [
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "Gradient descent is an iterative optimization..."}
    ]}
])

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    args=SFTConfig(
        output_dir="./lora-output",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=True,
        max_seq_length=512
    ),
    tokenizer=tokenizer
)
trainer.train()
model.save_pretrained("./lora-adapter")  # saves only the ~80MB adapter

Rank 16 — default

Default starting point. Good for most task adaptation use cases. Balances expressiveness against storage. Start here unless you have a reason not to.

Rank 32–64 — complex tasks

Complex reasoning, domain adaptation, knowledge injection. Gives better quality at modest cost. Use if rank 16 plateaus.

Rank 128+ — rare

Rarely needed. Diminishing returns; approaching full fine-tune cost. Only if you've validated that quality improves and memory is unconstrained.

⚠️ Target modules matter as much as rank. Apply LoRA to q_proj and v_proj at minimum. Adding k_proj, o_proj, and the MLP layers consistently improves quality at modest cost.

04 — Extreme Efficiency

QLoRA: 4-bit + LoRA

QLoRA quantizes the frozen base model to NF4 (4-bit NormalFloat) and trains LoRA adapters on top in BF16. A 65B model that required 780GB in FP32 fits on a single 48GB GPU. Three key innovations: NF4 quantization, double quantization (quantizing the quantization constants), and paged optimizers.

Memory Comparison

Model Full FT (FP32) LoRA (BF16) QLoRA (NF4) 7B 112 GB ~28 GB ~6 GB 13B 208 GB ~52 GB ~10 GB 70B 1,120 GB ~280 GB ~48 GB

QLoRA trades a small loss in convergence speed and final quality (~5%) for the ability to fine-tune billion-parameter models on consumer hardware. It's the most practical PEFT method for researchers without enterprise budgets.

💡 When to use QLoRA: You have a 13B+ model and a single GPU under 48GB VRAM. Quality loss is minimal; feasibility gain is massive.

Python · Merge LoRA adapter into base model for deployment

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
adapter_path   = "./lora-adapter"

# Load base model in full precision for merging
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="cpu"   # merge on CPU to avoid OOM on smaller GPUs
)

# Load PEFT model (base + adapter)
peft_model = PeftModel.from_pretrained(base_model, adapter_path)

# Merge adapter weights into base model
print("Merging adapter into base model...")
merged_model = peft_model.merge_and_unload()
print("Merge complete.")

# Save merged model — now a standalone model without PEFT dependency
merged_model.save_pretrained("./merged-model", safe_serialization=True)
tokenizer.save_pretrained("./merged-model")
print("Saved merged model to ./merged-model")

# The merged model:
# - Has ZERO runtime overhead vs base model
# - Can be served with any standard inference framework
# - Is indistinguishable from a full fine-tuned model

05 — Selection

Choosing Your Method

💻 Consumer GPU (< 24GB)

QLoRA — only practical option for large models
Enables fine-tuning 13B+ on a 3090/4090
Minimal quality penalty

🚀 Production Serving

LoRA merged — zero inference overhead after merge
Clearest performance characteristics
Simplest deployment pipeline

🔀 Multi-Task (One Base, Many Tasks)

Adapter layers — swap at runtime without reloading
Each adapter is orthogonal
Modest runtime overhead

✏️ Style/Tone Only

Prefix tuning — fewest params, fastest to train
Good for instruction following tweaks
Skip if you need knowledge injection

Decision Tree

Start with LoRA rank 16. If you hit GPU memory limits, use QLoRA. If you need to serve multiple specialized models simultaneously without reloading, use adapter layers. For cosmetic changes only (style, tone, instruction format), try prefix tuning first.

06 — Tooling

Tools & Libraries

HuggingFace PEFT

Core library

Canonical implementation. Supports all methods, integrates with Trainer.

Unsloth

Speed

2× faster LoRA training. Optimized CUDA kernels. Free tier available.

Axolotl

Config-driven

YAML-based fine-tuning orchestration. Supports QLoRA, multi-GPU distributed training.

LLaMA-Factory

Web UI + CLI

Web dashboard + command-line interface. Broad model support, includes quantization.

TRL (HuggingFace)

Advanced

SFT, DPO, PPO trainers built on PEFT. Full reinforcement learning pipeline.

DeepSpeed

Scale

Distributed training framework. PEFT-compatible for multi-GPU fine-tuning.

06 — Production

Production Deployment of PEFT Models

PEFT adapters are small — a LoRA adapter for a 7B model might be 40–80MB while the base model is 14GB. This creates a deployment opportunity: host one base model and hot-swap adapters per tenant or task. The PEFT library and vLLM both support multi-LoRA serving with dynamic adapter loading.

Key considerations: adapter compatibility (adapters are version-locked to the base model checkpoint), merging vs dynamic loading (merge for single-purpose deployment to eliminate serving overhead, dynamic loading for multi-task scenarios), and adapter versioning (treat adapters like model artifacts — version control, evaluation gates before promotion).

Method	Trainable Params	Memory Overhead	Quality vs Full FT	Best For
LoRA	0.1–1%	~40–80MB adapter	~95%	Domain adaptation, style
QLoRA	0.1–1%	~40MB + 4-bit base	~92%	Fine-tuning on consumer GPU
Prefix Tuning	<0.1%	Minimal	~85%	Prompt-style tasks
IA³	<0.05%	Minimal	~88%	Few-shot task specialization
Full Fine-Tuning	100%	Full model copy	100% (ceiling)	Maximum quality, large budget

07 — Further Reading

References

Academic Papers

Paper Hu, E.J. et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. — arxiv:2106.09685 ↗
Paper Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314. — arxiv:2305.14314 ↗
Paper Li, X.L. & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190. — arxiv:2101.00190 ↗

Documentation & Guides

Docs HuggingFace PEFT Library. github.com/huggingface/peft ↗
Docs Unsloth Documentation. github.com/unslothai/unsloth ↗
Docs Axolotl — Fine-tuning Framework. github.com/openaccess-ai-collective/axolotl ↗
Guide LLaMA-Factory Web UI. github.com/hiyouga/LLaMA-Factory ↗

Practitioner Writing

Blog Raschka, S. (2024). Practical Tips for Finetuning LLMs Using LoRA. — sebastian raschka.com ↗
Blog Tim Dettmers. (2024). QLoRA: Efficient Fine-Tuning of Large Language Models on Consumer GPUs. — huggingface.co/blog ↗

Parameter-Efficient Fine-Tuning

Why PEFT Exists

Method Comparison

LoRA in Depth

Rank Selection Guide

Rank 4–8 — minimal

Rank 16 — default

Rank 32–64 — complex tasks

Rank 128+ — rare

QLoRA: 4-bit + LoRA

Memory Comparison

Choosing Your Method

💻 Consumer GPU (< 24GB)

🚀 Production Serving

🔀 Multi-Task (One Base, Many Tasks)

✏️ Style/Tone Only

Decision Tree

Tools & Libraries

Production Deployment of PEFT Models

References

Related concepts