4-bit NF4 quantisation + LoRA adapters + double quantisation β fine-tune a 65B model on a single 48GB GPU. The technique that made LLM fine-tuning accessible to researchers without data-centre hardware.
Full fine-tuning: update all model weights. Best quality. Requires storing the full model in float16 for training (2 bytes/param), plus gradients and optimiser states (~16 bytes/param total). A 7B model needs ~112GB GPU memory β impractical without multiple A100s.
LoRA: freeze all original weights; inject small trainable rank-r matrices into attention layers. Only ~0.1β1% of parameters are trained. Memory: original model (float16) + small adapters + gradients. A 7B model needs ~16GB. Quality: within 1β2% of full fine-tuning for most tasks.
QLoRA (Dettmers et al. 2023): quantise the base model to 4-bit NF4 before adding LoRA adapters. This halves the memory of the base model again. A 7B model fits in ~6GB VRAM. A 65B model fits in 48GB. Quality loss from 4-bit quantisation is small when combined with LoRA β the adapters learn to compensate for quantisation error. QLoRA is now the default approach for fine-tuning large open models on limited hardware.
NF4 (Normal Float 4) is a 4-bit data type optimised for normally-distributed weights. LLM weights are approximately normally distributed (mean ~0, std ~small), so a quantisation scheme designed for this distribution uses its 16 levels more efficiently than a uniform grid.
NF4 maps each weight to the nearest of 16 quantisation levels that are spaced to minimise quantisation error for a unit normal distribution. Contrast with INT4 (uniform 4-bit integers) which wastes levels on rare extreme values.
import bitsandbytes as bnb
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Configure 4-bit NF4 quantisation
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 vs fp4 β NF4 is better for LLMs
bnb_4bit_use_double_quant=True, # double quantisation (see next section)
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16, store in 4-bit
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
)
# Check memory usage
for name, param in model.named_parameters():
if "weight" in name and param.dtype == torch.uint8:
print(f"{name}: 4-bit ({param.numel() * 0.5 / 1e6:.1f} MB)")
break
4-bit quantisation requires storing quantisation constants (one per 64-weight block). These constants are float32 β adding ~0.5 bytes/param overhead. Double quantisation quantises these constants too (to 8-bit), reducing the overhead to ~0.125 bytes/param.
The net result: a 7B model compressed from ~14GB (float16) to ~4.5GB (NF4 + double quantisation). Memory breakdown:
# Memory calculation for QLoRA fine-tuning
def qlora_memory_estimate(num_params_B: float, lora_rank: int = 16) -> dict:
# Base model in 4-bit: ~0.5 bytes per param
base_model_gb = num_params_B * 1e9 * 0.5 / 1e9
# LoRA adapters in float16 (typically applied to Q, K, V, O projections)
# ~2 * rank * d_model * num_layers * 2 bytes
# Rough estimate: 0.01% of params per rank unit
adapters_gb = num_params_B * lora_rank * 0.0001
# Gradients + optimiser (AdamW) for adapters only
optim_gb = adapters_gb * 4 # 4Γ adapter size for Adam states
total = base_model_gb + adapters_gb + optim_gb
return {"base": base_model_gb, "adapters": adapters_gb,
"optimiser": optim_gb, "total": total}
for size in [7, 13, 34, 70]:
m = qlora_memory_estimate(size)
print(f"{size}B model: {m['total']:.1f} GB total "
f"(base={m['base']:.1f}, adapters+optim={m['adapters']+m['optimiser']:.1f})")
# 7B: 5.2 GB β fits in RTX 3080 10GB (barely)
# 13B: 9.1 GB β needs RTX 4090 or A5000
# 34B: 22.1 GB β needs A100 40GB
# 70B: 43.0 GB β needs A100 80GB or 2ΓA100 40GB
from transformers import AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
model_name = "meta-llama/Llama-3.1-8B"
# 1. Load tokeniser
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 2. Load model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_name,
quantization_config=bnb_config, device_map="auto")
# 3. Prepare for k-bit training (cast some layers back to float)
model = prepare_model_for_kbit_training(model)
# 4. Add LoRA adapters
lora_config = LoraConfig(
r=16, # rank β 8 to 64, higher = more capacity, more memory
lora_alpha=32, # scaling factor (usually 2Γr)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # e.g., "0.26% of params are trainable"
# 5. Train
dataset = load_dataset("your-dataset", split="train")
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=SFTConfig(
output_dir="./qlora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
max_seq_length=2048,
),
)
trainer.train()
trainer.save_model("./qlora-final")
Rank (r): controls the number of trainable parameters. r=8: minimal (good for simple style adaptation). r=16: standard default. r=64: for complex tasks like code generation or domain adaptation with large distribution shift. Higher rank = more capacity but more memory and risk of overfitting.
Alpha (lora_alpha): scaling factor applied to LoRA outputs before adding to frozen weights. Effective LoRA contribution = (alpha/r) Γ LoRA_output. Setting alpha=2r keeps the effective scale stable regardless of r. Common choices: r=16/alpha=32, r=64/alpha=64.
Target modules: which weight matrices to add adapters to. Minimal: q_proj, v_proj. Standard: q_proj, k_proj, v_proj, o_proj. Aggressive: include FFN layers (gate_proj, up_proj, down_proj) for tasks requiring more knowledge update.
# Quick sweep to find good rank:
for rank in [8, 16, 32, 64]:
config = LoraConfig(r=rank, lora_alpha=rank*2,
target_modules=["q_proj", "v_proj"], task_type="CAUSAL_LM")
m = get_peft_model(model, config)
trainable = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f"r={rank}: {trainable/1e6:.1f}M trainable params")
# r=8: 6.8M | r=16: 13.6M | r=32: 27.3M | r=64: 54.5M
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model in float16 (not 4-bit β merging requires full precision)
base = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B", torch_dtype=torch.float16, device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base, "./qlora-final")
# Merge adapter weights into base model (produces a regular model, no PEFT overhead)
merged = model.merge_and_unload()
# Save as standard HuggingFace model
merged.save_pretrained("./merged-model")
# Or export to GGUF for local inference with Ollama/llama.cpp:
# python llama.cpp/convert_hf_to_gguf.py ./merged-model --outtype f16
# ./llama.cpp/llama-quantize merged-f16.gguf merged-q4km.gguf Q4_K_M
4-bit models cannot be fine-tuned directly β you need prepare_model_for_kbit_training. This function casts certain normalisation layers back to float32 (they can't be trained in 4-bit) and enables gradient checkpointing. Skipping this step causes NaN losses or silent failures where gradients don't flow through the base model correctly.
Gradient checkpointing trades compute for memory. model.gradient_checkpointing_enable() recomputes activations during the backward pass instead of storing them. This reduces activation memory by ~60% but increases training time by ~20%. Essential for fitting large models in limited VRAM.
The merged model is larger than the 4-bit base. Merging adds LoRA weights back into the base weights at full precision β the merged model is back to float16 size. If you need a small model for deployment, quantise the merged model to GGUF/GPTQ after merging, not before.
| Hyperparameter | Typical Range | Impact | Recommendation |
|---|---|---|---|
| Quantisation bits | 4-bit (NF4) | Memory: saves 75% vs fp16 | Always use NF4; 8-bit rarely worth the memory cost |
| LoRA rank (r) | 8β64 | Trainable params, expressivity | r=16 for most tasks; r=64 for complex style transfer |
| LoRA alpha | 16β128 | Effective learning rate scale | Set alpha = 2r as a starting point |
| Gradient checkpointing | On/Off | Memory vs compute trade-off | Always enable; adds ~30% compute, saves 60% activation memory |
| Batch size | 1β8 (+ grad accum) | Training stability | Effective batch 32β64 via gradient accumulation |
QLoRA's memory savings come primarily from keeping base model weights frozen in 4-bit and only computing gradients through the LoRA adapters in bfloat16. The double quantisation further reduces the quantisation constants themselves from fp32 to fp8, saving ~0.4 bits per parameter. For a 7B model this totals roughly 5 GB VRAM for the base model, plus ~300 MB for LoRA adapters β making a single 16 GB consumer GPU sufficient for fine-tuning models that previously required 4Γ A100s.
When evaluating a QLoRA fine-tuned model, always compare against the quantised base model, not the full-precision base. The NF4 quantisation itself introduces a small quality degradation before any fine-tuning β typically 1β3 points on standard benchmarks. Your fine-tuning should recover this and improve beyond it. If your fine-tuned QLoRA model performs below the quantised baseline on held-out tasks, the training data or hyperparameters need adjustment before deployment.
The QLoRA paper demonstrated that 4-bit NF4 quantisation loses less than 1% accuracy vs bfloat16 on most downstream tasks. NF4 non-uniform bin boundaries are optimised for zero-centred normal weight distributions, resulting in lower quantisation error than linear 4-bit for the same bit budget. Always prefer NF4 over fp4 when using bitsandbytes.