GPTQ

GPTQ algorithm overview
Quantising a model with AutoGPTQ
Serving GPTQ models
GPTQ vs AWQ vs bitsandbytes
Choosing bit-width and group size
Calibration data
Gotchas

SECTION 01

GPTQ algorithm overview

GPTQ (Frantar et al. 2022) quantises a pre-trained model's weights to low-bit integers without any fine-tuning. The key insight: instead of rounding each weight independently, GPTQ uses second-order Hessian information to compensate for quantisation errors — when a weight is rounded, other weights in the same layer are adjusted to cancel out the introduced error.

The process works layer by layer. For each layer, GPTQ solves an optimal brain surgeon (OBS) variant problem: given the Hessian matrix computed from calibration data, find the minimal-error quantisation for each weight column, then update remaining unquantised columns to compensate. This runs in O(d²) per layer, feasible in hours on a single GPU even for 70B models.

The result is a model where each weight is stored as a 4-bit integer, with a separate fp16 scale factor per group (typically 128 weights). The dequantisation to fp16 happens during inference, typically using fused CUDA kernels (ExLlamaV2, AutoGPTQ).

SECTION 02

Quantising a model with AutoGPTQ

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch

model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Calibration dataset — 128 samples is sufficient
calibration_data = [
    tokenizer(text, return_tensors="pt").input_ids
    for text in [
        "The quick brown fox jumps over the lazy dog.",
        "In machine learning, a neural network...",
        # ... 126 more diverse examples
    ]
]

quant_config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantisation (also supports 3, 8)
    group_size=128,      # weights per scale factor (128 is standard)
    damp_percent=0.01,   # numerical stability for Hessian inversion
    desc_act=False,      # True gives better quality but slower inference
)

model = AutoGPTQForCausalLM.from_pretrained(model_name, quant_config)
model.quantize(calibration_data)
model.save_quantized("llama3-8b-gptq-4bit", use_safetensors=True)
tokenizer.save_pretrained("llama3-8b-gptq-4bit")
print("Quantisation complete!")

SECTION 03

Serving GPTQ models

Most LLM serving frameworks support GPTQ natively. The quantised model is stored on disk in 4-bit, loaded to GPU, and dequantised on-the-fly during the forward pass using optimised CUDA kernels.

# Load a GPTQ model for inference
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, pipeline

model = AutoGPTQForCausalLM.from_quantized(
    "llama3-8b-gptq-4bit",
    use_safetensors=True,
    device="cuda:0",
    inject_fused_attention=True,   # ExLlamaV2 kernels for 2× speedup
    inject_fused_mlp=True,
)
tokenizer = AutoTokenizer.from_pretrained("llama3-8b-gptq-4bit")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("The capital of France is", max_new_tokens=50)
print(result[0]["generated_text"])

# Or use with vLLM / TGI directly:
# vllm serve --model ./llama3-8b-gptq-4bit --quantization gptq
# docker run ... --quantize gptq (TGI)

SECTION 04

GPTQ vs AWQ vs bitsandbytes

All three quantise weights to 4-bit, but differ in algorithm and trade-offs:

GPTQ: Second-order Hessian optimisation per layer. Good quality. Takes 1–4 hours to quantise a 7B model. Requires calibration data. Widely supported across serving frameworks. Best for: deploying to production with prebuilt quantised models.

AWQ (Activation-aware Weight Quantisation): Identifies which weights are most important by looking at activation magnitudes, then protects them during quantisation. Typically better quality than GPTQ at 4-bit, especially on instruction-following tasks. Quantisation is faster. Best for: highest quality 4-bit inference.

bitsandbytes: On-the-fly quantisation during model loading — no separate quantisation step needed. Convenient but slower at inference (no fused kernels). Best for: quick experiments, fine-tuning (QLoRA).

SECTION 05

Choosing bit-width and group size

Bit-width: 4-bit is the standard sweet spot — roughly 4× compression vs fp16 with <1% perplexity degradation. 3-bit gives 5.3× compression but noticeable quality loss (2–4% perplexity increase). 8-bit is almost lossless but only 2× compression, better served by bitsandbytes-int8 for simplicity.

Group size: Controls the granularity of scale factors. Smaller group size = more scale factors = better quality but larger file size.

group_size=128: Standard. Good quality, small overhead.
group_size=32: Better quality, 4× more scale factors (~4% larger file).
group_size=-1: One scale per row (column-wise). Lowest overhead, lowest quality.

desc_act=True: Activations are sorted by magnitude before quantisation. Better quality but requires reordering during inference — not all kernels support it. Use desc_act=False for maximum compatibility and speed.

SECTION 06

Calibration data

GPTQ needs calibration data to compute the Hessian (second-order statistics of activations). This data should be representative of your inference distribution.

from datasets import load_dataset

# Standard: use a slice of the pretraining data
wiki = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
texts = [t for t in wiki["text"] if len(t) > 100][:512]

calibration_data = [
    tokenizer(
        text,
        return_tensors="pt",
        max_length=2048,
        truncation=True,
    ).input_ids
    for text in texts[:128]
]

128 samples of 2048 tokens each is sufficient for most models. Using domain-specific data (your actual prompts) can improve quality for specialised use cases, but wikitext works well as a general calibration set.

SECTION 07

Gotchas

VRAM during quantisation: The quantisation process loads the full fp16 model. A 70B model needs ~140GB just to load, requiring multiple GPUs even though the output fits in one. Use --nsamples 128 and --wbits 4 with the AutoGPTQ CLI for large models.

Pre-quantised models on HF Hub: For popular models, pre-quantised GPTQ versions are available (search for -GPTQ suffix, e.g. TheBloke/Llama-2-7B-GPTQ). No need to quantise yourself unless you have a custom model.

Kernel compatibility: The ExLlamaV2 kernels that make GPTQ fast require specific GPU architectures (Ampere+). On older GPUs, fall back to AutoGPTQ's slower kernels or use a different quantisation scheme.

Not suitable for fine-tuning: GPTQ-quantised weights can't be fine-tuned directly (gradients don't flow through integer weights). For fine-tuning, use QLoRA with bitsandbytes instead.

GPTQ vs. Other Post-Training Quantization Methods

GPTQ (Generative Pre-Trained Quantization) applies an optimal second-order weight quantization algorithm that minimizes the reconstruction error of each layer's output rather than independently quantizing each weight. By accounting for the interactions between weights during quantization, GPTQ achieves significantly better model quality at 4-bit precision compared to simpler round-to-nearest approaches.

Method	Algorithm	Quality at 4-bit	Quantization Speed	Inference Backend
GPTQ	Second-order (Hessian)	Very good	Slow (~hours)	AutoGPTQ, ExLlama
AWQ	Activation-aware	Very good	Medium (~30 min)	AutoAWQ, vLLM
GGUF Q4_K_M	K-quants (grouped)	Good	Fast	llama.cpp
bitsandbytes NF4	NF4 per-block	Good	Fast (on-the-fly)	Transformers (BnB)

GPTQ requires a small calibration dataset of representative text during quantization. The algorithm uses this dataset to compute per-layer Hessian matrices that characterize how sensitive each weight is to quantization error, then applies a greedy block-wise quantization that compensates for errors introduced in earlier weights when quantizing later weights. The calibration data significantly affects quality for domain-specific models — quantizing a code model with general text calibration data produces noticeably worse results than using code-specific calibration examples.

ExLlamaV2 is the most performant inference backend for GPTQ-quantized models, implementing custom CUDA kernels for 4-bit matrix multiplication that achieve throughput competitive with FP16 inference at a fraction of the VRAM requirement. For 70B models that would require 4–5 A100s in FP16, ExLlamaV2 with GPTQ quantization enables serving on 2 consumer-grade GPUs. This makes GPTQ + ExLlamaV2 one of the most cost-effective stacks for self-hosting large open-weight models.

GPTQ quantization quality is sensitive to the order in which weights are quantized. The algorithm processes weights in a sequence determined by the columns of each weight matrix, greedily finding the quantization of each weight that minimizes the reconstruction error of the entire layer output given the already-quantized earlier weights. This sequential dependency means GPTQ cannot trivially be parallelized across weight dimensions, contributing to its longer quantization time compared to per-block methods like bitsandbytes NF4.

Group size is an important GPTQ hyperparameter controlling the granularity of the scale factors. A group size of 128 means that every 128 weights share one floating-point scale constant, providing moderate compression. Smaller group sizes (32) improve quantization quality at the cost of more scale storage overhead; the default group size of 128 represents the best practical trade-off for most models. For extreme compression requirements (3-bit quantization), smaller group sizes become more important because the per-group error budget is tighter.

Activation reordering is a GPTQ enhancement that rearranges the columns of weight matrices before quantization to group similar-magnitude weights together. Grouping weights by magnitude reduces the quantization error within each group, since the scale factor more accurately represents a narrow range of similar values than a mixed range. Models quantized with activation reordering achieve noticeably better perplexity at 4-bit and lower precisions, with the reordering metadata stored in the quantized model file and applied transparently during inference.

Verifying GPTQ quantization quality before deployment is straightforward: compute perplexity on a held-out text sample using both the FP16 original and the quantized model, then calculate the perplexity increase ratio. A well-quantized 4-bit model typically shows a perplexity increase of 5–15% versus FP16 for general text. Increases above 20% indicate quantization problems worth investigating — common causes include misconfigured group size, incompatible model architecture features, or poor calibration data. Task-specific evaluation on downstream benchmarks provides complementary signal to the perplexity metric.

Table of Contents