Quantisation

BitsAndBytes

bitsandbytes (bnb) provides on-the-fly 8-bit and 4-bit quantisation for PyTorch models, enabling fine-tuning and inference of large models on consumer GPUs. The backbone of QLoRA — no separate quantisation step required.

On-the-fly
Quantisation
4-bit NF4
Best format
QLoRA
Fine-tuning support

Table of Contents

SECTION 01

What bitsandbytes provides

bitsandbytes is a Python library by Tim Dettmers (now maintained by HuggingFace) that adds 8-bit and 4-bit quantised linear layers to PyTorch. Unlike GPTQ or AWQ, it doesn't require a separate quantisation step — the model is quantised on-the-fly as it's loaded from fp16/bf16 weights. This makes it the easiest way to run large models on limited hardware.

Its main use cases: (1) loading a 7B–70B model on a single consumer GPU for inference, and (2) QLoRA fine-tuning — training LoRA adapters on top of a 4-bit quantised base model. For production serving, GPTQ or AWQ with fused kernels is faster; bnb trades throughput for convenience.

pip install bitsandbytes
# Verify installation
python -c "import bitsandbytes; print(bitsandbytes.__version__)"
SECTION 02

8-bit inference (LLM.int8)

LLM.int8() (Dettmers et al. 2022) was the first method to make 7B+ model inference practical on consumer GPUs. It uses a mixed-precision decomposition: most weights are stored in int8, but outlier feature dimensions (large activation values) are kept in fp16. This maintains model quality while halving memory usage.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load in 8-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_8bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Check memory savings
for name, param in model.named_parameters():
    if "weight" in name:
        print(f"{name}: {param.dtype}, {param.numel() * param.element_size() / 1e6:.1f}MB")
        break

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

8-bit is almost lossless (perplexity increase <0.3%) but only gives 2× compression. For most use cases, 4-bit is the better choice.

SECTION 03

4-bit NF4 and double quantisation

from transformers import BitsAndBytesConfig, AutoModelForCausalLM
import torch

# BitsAndBytesConfig encapsulates all quantisation settings
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 (better for normal-distributed weights) or fp4
    bnb_4bit_use_double_quant=True,     # quantise the scale factors too (~0.4 bits/param saving)
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16 even though weights stored in 4-bit
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Memory usage comparison:
# fp16: 16GB | int8: 8GB | nf4: ~4.5GB | nf4 + double quant: ~4.3GB
print(f"Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")

Double quantisation quantises the scale factors themselves (which are fp32) using 8-bit integers, saving an additional ~0.4 bits per parameter. On a 7B model this saves ~500MB.

SECTION 04

QLoRA integration

bitsandbytes is the backbone of QLoRA fine-tuning. The base model is loaded in 4-bit NF4, then LoRA adapters (in fp16) are added on top. During the forward pass, weights are dequantised to bf16 for the matrix multiplication, gradients flow only through the LoRA parameters, and the base model weights stay frozen in 4-bit.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1",
                                              quantization_config=bnb_config,
                                              device_map="auto")

# Prepare for training — enables gradient checkpointing, casts norms to fp32
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"],
                          lora_dropout=0.05, task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,752,071,168 || trainable%: 0.1118
SECTION 05

Mixed-precision vs full quantisation

bitsandbytes uses compute_dtype separate from storage dtype. Weights are stored in 4-bit NF4 but dequantised to bf16 for the actual matrix multiplication. This is crucial: most GPU hardware doesn't have native 4-bit matrix multiply instructions, so the 4-bit format is purely for memory storage. During the forward pass: load 4-bit weights → dequantise to bf16 → multiply → stay in bf16.

This means bitsandbytes quantisation is memory-bound, not compute-bound. It saves memory (and therefore allows larger batch sizes or longer contexts) but doesn't speed up the matrix multiply itself. For compute-bound workloads (large batch sizes), GPTQ or AWQ with int4 tensor core support is faster.

SECTION 06

Memory vs speed trade-offs

Summary of quantisation options for a 7B model (approximate):

MethodVRAMInference speedQuality
fp16 (no quant)14 GBBaselineBest
bnb int87 GB70% of fp16~fp16
bnb nf44.5 GB50% of fp16-1% perplexity
GPTQ 4-bit4 GB80% of fp16-1% perplexity
AWQ 4-bit4 GB85% of fp16-0.5% perplexity

For inference-only deployments with throughput requirements: AWQ > GPTQ > bnb. For fine-tuning and quick experimentation: bnb wins on convenience.

SECTION 07

Gotchas

CUDA version requirements: bitsandbytes requires CUDA 11.1+ and specific GPU architecture (Kepler+ for int8, Ampere+ recommended for best performance). Check bnb.cuda_setup.main() output if you get CUDA errors.

CPU offloading doesn't work with 4-bit: device_map="auto" can offload some layers to CPU, but 4-bit operations require GPU. Set llm_int8_enable_fp32_cpu_offload=True to allow partial CPU offload with int8 only.

Saving quantised models: bnb-quantised models save weights in their original fp16 format — the quantisation happens at load time. If you want to save a pre-quantised model to skip quantisation at load time, use GPTQ or AWQ instead.

Gradient accumulation with QLoRA: Use gradient_accumulation_steps > 1 to simulate larger batches. With 4-bit base + LoRA, effective batch of 64 is achievable on a single 24GB GPU with accumulation=8 and per_device_batch=8.

bitsandbytes Quantization Methods

The bitsandbytes library provides 8-bit and 4-bit quantization for PyTorch model weights, enabling large models to be loaded and run on GPUs with significantly less VRAM than required for full-precision weights. It integrates directly with HuggingFace Transformers via the load_in_8bit and load_in_4bit flags, making quantized inference accessible with minimal code changes.

MethodPrecisionVRAM ReductionQuality ImpactUse Case
FP16 baseline16-bitNone (reference)NoneStandard inference
LLM.int8()8-bit~50%MinimalInference on consumer GPUs
NF4 (QLoRA)4-bit~75%SmallFine-tuning large models
FP44-bit~75%Slightly more than NF4Inference only

The LLM.int8() algorithm uses a mixed-precision decomposition: it identifies outlier feature dimensions in the activation tensors that carry disproportionate magnitude and keeps those in FP16, while quantizing the remaining 99.9% of values to INT8. This decomposition approach avoids the catastrophic quality degradation that occurs with naive INT8 quantization of transformer activations, which are characterized by extreme outlier values that emerge in models above a certain scale threshold (approximately 6B parameters).

NF4 quantization uses a data type specifically designed for neural network weights. Normal Float 4 (NF4) is constructed by finding the 16 quantization levels that are optimal for normally distributed data — which weight tensors approximately follow after normalization. This makes NF4 more information-efficient than INT4 for neural network weights, explaining why QLoRA fine-tuning with NF4 quantization retains higher quality than naive 4-bit quantization with the same bit budget.

Double quantization, introduced in QLoRA, applies a second level of quantization to the quantization constants themselves. In block-wise quantization, each block of 64 weights shares a floating-point scale constant. These scale constants consume approximately 0.5 bits per weight at 64-block granularity. Double quantization quantizes these scale constants from FP32 to FP8, reducing the overhead from 0.5 bits/weight to approximately 0.127 bits/weight — a significant reduction that further decreases the total memory footprint of 4-bit quantized models.

Page-wise optimizer states in bitsandbytes reduce the memory cost of Adam optimizer states during fine-tuning. Standard Adam maintains FP32 first and second moment estimates for every parameter, consuming 8 bytes per parameter — double the size of FP16 weights. The 8-bit paged Adam optimizer in bitsandbytes quantizes these optimizer states to 8-bit and pages them to CPU RAM when GPU memory is under pressure, allowing fine-tuning of larger models on memory-constrained hardware without gradient accumulation tricks that increase training time.

Mixed-precision inference with bitsandbytes allows the quantization level to vary by layer. The first and last transformer layers — the embedding layer and the language model head — are kept in full precision because they are particularly sensitive to quantization error. Intermediate attention and FFN layers use 8-bit or 4-bit quantization. This layer-wise precision assignment is the default behavior of LLM.int8() and produces better output quality than uniformly quantizing all layers, at minimal additional memory cost since the unquantized boundary layers represent a small fraction of total parameters.

Gradient checkpointing combined with bitsandbytes quantization enables fine-tuning of very large models on limited hardware by trading compute for memory. Without gradient checkpointing, activations from all transformer layers are stored in GPU memory during the forward pass to speed up the backward pass. With checkpointing, only a subset of activations are stored; the rest are recomputed during backpropagation. Combined with 4-bit model weights and 8-bit optimizer states, this combination makes fine-tuning 70B+ parameter models feasible on machines with 2–4 consumer GPUs that would otherwise require an 8-GPU server.