QLoRA

QLoRA vs LoRA vs full fine-tuning
NF4 quantisation
Double quantisation
Setting up with bitsandbytes
Training with TRL
Choosing rank and alpha
Gotchas

SECTION 01

QLoRA vs LoRA vs full fine-tuning

Full fine-tuning: update all model weights. Best quality. Requires storing the full model in float16 for training (2 bytes/param), plus gradients and optimiser states (~16 bytes/param total). A 7B model needs ~112GB GPU memory — impractical without multiple A100s.

LoRA: freeze all original weights; inject small trainable rank-r matrices into attention layers. Only ~0.1–1% of parameters are trained. Memory: original model (float16) + small adapters + gradients. A 7B model needs ~16GB. Quality: within 1–2% of full fine-tuning for most tasks.

QLoRA (Dettmers et al. 2023): quantise the base model to 4-bit NF4 before adding LoRA adapters. This halves the memory of the base model again. A 7B model fits in ~6GB VRAM. A 65B model fits in 48GB. Quality loss from 4-bit quantisation is small when combined with LoRA — the adapters learn to compensate for quantisation error. QLoRA is now the default approach for fine-tuning large open models on limited hardware.

SECTION 02

NF4 quantisation

NF4 (Normal Float 4) is a 4-bit data type optimised for normally-distributed weights. LLM weights are approximately normally distributed (mean ~0, std ~small), so a quantisation scheme designed for this distribution uses its 16 levels more efficiently than a uniform grid.

NF4 maps each weight to the nearest of 16 quantisation levels that are spaced to minimise quantisation error for a unit normal distribution. Contrast with INT4 (uniform 4-bit integers) which wastes levels on rare extreme values.

import bitsandbytes as bnb
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Configure 4-bit NF4 quantisation
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",         # NF4 vs fp4 — NF4 is better for LLMs
    bnb_4bit_use_double_quant=True,    # double quantisation (see next section)
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16, store in 4-bit
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Check memory usage
for name, param in model.named_parameters():
    if "weight" in name and param.dtype == torch.uint8:
        print(f"{name}: 4-bit ({param.numel() * 0.5 / 1e6:.1f} MB)")
        break

SECTION 03

Double quantisation

4-bit quantisation requires storing quantisation constants (one per 64-weight block). These constants are float32 — adding ~0.5 bytes/param overhead. Double quantisation quantises these constants too (to 8-bit), reducing the overhead to ~0.125 bytes/param.

The net result: a 7B model compressed from ~14GB (float16) to ~4.5GB (NF4 + double quantisation). Memory breakdown:

# Memory calculation for QLoRA fine-tuning
def qlora_memory_estimate(num_params_B: float, lora_rank: int = 16) -> dict:
    # Base model in 4-bit: ~0.5 bytes per param
    base_model_gb = num_params_B * 1e9 * 0.5 / 1e9
    # LoRA adapters in float16 (typically applied to Q, K, V, O projections)
    # ~2 * rank * d_model * num_layers * 2 bytes
    # Rough estimate: 0.01% of params per rank unit
    adapters_gb = num_params_B * lora_rank * 0.0001
    # Gradients + optimiser (AdamW) for adapters only
    optim_gb = adapters_gb * 4   # 4× adapter size for Adam states
    total = base_model_gb + adapters_gb + optim_gb
    return {"base": base_model_gb, "adapters": adapters_gb,
            "optimiser": optim_gb, "total": total}

for size in [7, 13, 34, 70]:
    m = qlora_memory_estimate(size)
    print(f"{size}B model: {m['total']:.1f} GB total "
          f"(base={m['base']:.1f}, adapters+optim={m['adapters']+m['optimiser']:.1f})")
# 7B:  5.2 GB  — fits in RTX 3080 10GB (barely)
# 13B: 9.1 GB  — needs RTX 4090 or A5000
# 34B: 22.1 GB — needs A100 40GB
# 70B: 43.0 GB — needs A100 80GB or 2×A100 40GB

SECTION 04

Training with TRL

from transformers import AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

model_name = "meta-llama/Llama-3.1-8B"

# 1. Load tokeniser
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Load model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_name,
    quantization_config=bnb_config, device_map="auto")

# 3. Prepare for k-bit training (cast some layers back to float)
model = prepare_model_for_kbit_training(model)

# 4. Add LoRA adapters
lora_config = LoraConfig(
    r=16,                   # rank — 8 to 64, higher = more capacity, more memory
    lora_alpha=32,          # scaling factor (usually 2×r)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # e.g., "0.26% of params are trainable"

# 5. Train
dataset = load_dataset("your-dataset", split="train")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./qlora-output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
        max_seq_length=2048,
    ),
)
trainer.train()
trainer.save_model("./qlora-final")

SECTION 05

Choosing rank and alpha

Rank (r): controls the number of trainable parameters. r=8: minimal (good for simple style adaptation). r=16: standard default. r=64: for complex tasks like code generation or domain adaptation with large distribution shift. Higher rank = more capacity but more memory and risk of overfitting.

Alpha (lora_alpha): scaling factor applied to LoRA outputs before adding to frozen weights. Effective LoRA contribution = (alpha/r) × LoRA_output. Setting alpha=2r keeps the effective scale stable regardless of r. Common choices: r=16/alpha=32, r=64/alpha=64.

Target modules: which weight matrices to add adapters to. Minimal: q_proj, v_proj. Standard: q_proj, k_proj, v_proj, o_proj. Aggressive: include FFN layers (gate_proj, up_proj, down_proj) for tasks requiring more knowledge update.

# Quick sweep to find good rank:
for rank in [8, 16, 32, 64]:
    config = LoraConfig(r=rank, lora_alpha=rank*2,
                        target_modules=["q_proj", "v_proj"], task_type="CAUSAL_LM")
    m = get_peft_model(model, config)
    trainable = sum(p.numel() for p in m.parameters() if p.requires_grad)
    print(f"r={rank}: {trainable/1e6:.1f}M trainable params")
# r=8:  6.8M  | r=16: 13.6M | r=32: 27.3M | r=64: 54.5M

SECTION 06

Merging adapters for deployment

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model in float16 (not 4-bit — merging requires full precision)
base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B", torch_dtype=torch.float16, device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base, "./qlora-final")

# Merge adapter weights into base model (produces a regular model, no PEFT overhead)
merged = model.merge_and_unload()

# Save as standard HuggingFace model
merged.save_pretrained("./merged-model")

# Or export to GGUF for local inference with Ollama/llama.cpp:
# python llama.cpp/convert_hf_to_gguf.py ./merged-model --outtype f16
# ./llama.cpp/llama-quantize merged-f16.gguf merged-q4km.gguf Q4_K_M

SECTION 07

Gotchas

4-bit models cannot be fine-tuned directly — you need prepare_model_for_kbit_training. This function casts certain normalisation layers back to float32 (they can't be trained in 4-bit) and enables gradient checkpointing. Skipping this step causes NaN losses or silent failures where gradients don't flow through the base model correctly.

Gradient checkpointing trades compute for memory. model.gradient_checkpointing_enable() recomputes activations during the backward pass instead of storing them. This reduces activation memory by ~60% but increases training time by ~20%. Essential for fitting large models in limited VRAM.

The merged model is larger than the 4-bit base. Merging adds LoRA weights back into the base weights at full precision — the merged model is back to float16 size. If you need a small model for deployment, quantise the merged model to GGUF/GPTQ after merging, not before.

Hyperparameter	Typical Range	Impact	Recommendation
Quantisation bits	4-bit (NF4)	Memory: saves 75% vs fp16	Always use NF4; 8-bit rarely worth the memory cost
LoRA rank (r)	8–64	Trainable params, expressivity	r=16 for most tasks; r=64 for complex style transfer
LoRA alpha	16–128	Effective learning rate scale	Set alpha = 2r as a starting point
Gradient checkpointing	On/Off	Memory vs compute trade-off	Always enable; adds ~30% compute, saves 60% activation memory
Batch size	1–8 (+ grad accum)	Training stability	Effective batch 32–64 via gradient accumulation

QLoRA

Table of Contents

QLoRA vs LoRA vs full fine-tuning

NF4 quantisation

Double quantisation

Training with TRL

Choosing rank and alpha

Merging adapters for deployment

Gotchas

QLoRA Hyperparameter Reference

QLoRA

Table of Contents

QLoRA vs LoRA vs full fine-tuning

NF4 quantisation

Double quantisation

Training with TRL

Choosing rank and alpha

Merging adapters for deployment

Gotchas

QLoRA Hyperparameter Reference

Related concepts