Reducing model memory by 2–8× with controlled accuracy loss — formats, methods, and tradeoffs when and how to quantize.
Floating-point weights are the default in deep learning. A 7B parameter model in FP32 (32-bit floats) weighs 28 GB — too large for most consumer GPUs and expensive to serve. Even FP16 (16-bit, the current standard) is 14 GB for 7B, 140 GB for 70B.
Quantization reduces precision: convert FP32/FP16 weights to INT8, INT4, or custom formats like NF4. This cuts memory 2–8× with surprisingly small quality loss. The tradeoff: reduced precision costs some accuracy, and inference requires dequantization (fast but not free).
| Format | Bits | Bytes/param | 7B size | 70B size | Quality loss |
|---|---|---|---|---|---|
| FP32 | 32 | 4 | 28 GB | 280 GB | Baseline |
| BF16 | 16 | 2 | 14 GB | 140 GB | Negligible |
| FP8 | 8 | 1 | 7 GB | 70 GB | Minimal (<1%) |
| INT8 | 8 | 1 | 7 GB | 70 GB | Minimal (<1%) |
| INT4 / NF4 | 4 | 0.5 | 3.5 GB | 35 GB | Small (2–5%) |
| INT2 | 2 | 0.25 | 1.75 GB | 17.5 GB | Large (20%+) |
INT4 is the sweet spot for most practitioners: 4× memory reduction, ~5% quality loss at worst. INT8 gives ~2× reduction with negligible loss. INT2 is rarely practical due to quality degradation.
Apply quantization after a model is fully trained. No retraining, no modification of the training loop. This is the default approach for open-source models: take a released weight file, convert it to INT4/INT8, and use it.
Weight-only quantization: quantize weights to INT4/INT8, keep activations (intermediate values) in FP16 during inference. The model dequantizes weights on the fly — slightly slower than native inference but much smaller.
Weight + activation quantization (W8A8): quantize both weights and activations to INT8. Requires hardware support (e.g., NVIDIA H100 has INT8 tensor cores) but can be significantly faster than weight-only on modern GPUs.
Quantize each layer using information from the Hessian (second derivative) of the loss. This captures which weights are most critical and protects them with error correction. Gold standard for 4-bit weight-only quantization.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, time
model_id = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
prompt = "Explain transformer attention in exactly two sentences."
# Load AWQ-quantized model (4-bit, ~4GB vs ~14GB for fp16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Benchmark generation
start = time.perf_counter()
with torch.inference_mode():
output = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.0,
do_sample=False
)
elapsed = time.perf_counter() - start
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True)
tokens_generated = output.shape[1] - inputs.input_ids.shape[1]
print(f"Output: {response}")
print(f"Speed: {tokens_generated/elapsed:.1f} tok/s")
print(f"Memory: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
# AWQ 4-bit: ~3.8GB, ~45 tok/s on A10G
# fp16: ~14GB, ~18 tok/s on A10G
Protect salient weights (those with high activation magnitudes) by quantizing less aggressively. Faster to quantize than GPTQ while maintaining similar quality. Becoming more common.
Migrate quantization difficulty from activations to weights via per-channel scaling. Enables W8A8 (both weights and activations INT8) on hardware with INT8 support, achieving 2× speedup vs. weight-only.
Format designed for llama.cpp and local CPU/GPU inference. Supports mixed-precision quantization: different layers or even weight groups can use different bit widths (Q2_K, Q4_K_M, Q8_0, etc.). Pragmatic for end users.
Pros: No retraining needed, instant deployment. Cons: Slightly lower quality than QAT, requires careful calibration data selection. For most practitioners, PTQ is sufficient; only invest in QAT if quality plateaus.
During training, simulate quantization noise using fake quantization ops. The model learns to be robust to the precision loss, so when you actually quantize post-training, quality is better than PTQ at the same bit width.
The cost: you must modify the training loop and run a full training pass again. For a 70B model, this takes weeks. QAT is primarily used by model labs (Google, Meta, Anthropic) who have the compute budget and ship quantized weights directly (e.g., Gemma INT8, Llama 2 Int8).
| Aspect | PTQ | QAT |
|---|---|---|
| Retraining needed | No | Yes (full run) |
| Time to apply | Hours | Days–weeks |
| Quality at INT4 | Good (≈7.6 PPL) | Better (≈7.2 PPL) |
| Who uses it | Most OSS deployments | Model labs (Google, Meta) |
| When to use | Default for serving | When PTQ quality insufficient |
PTQ methods need a calibration dataset to measure activation ranges and choose quantization thresholds. Pick a poor calibration set (random noise) and quality suffers. Pick a good one (representative of your use case) and quality is much better.
Typical calibration: 128–512 diverse samples from the same domain as your inference data. For general chat, use a mix of instructions and documents. For specialized domains, use domain-specific text.
Perplexity (PPL): Standard metric on WikiText-2 or C4. Lower is better. A 5–10% PPL increase is acceptable; >20% indicates poor quantization.
Downstream tasks: Run MMLU, HellaSwag, ARC to measure end-to-end impact. Some quantized models lose 1–2% accuracy on reasoning tasks but remain usable.
Human evaluation: For critical applications, have humans rate responses from the quantized model vs. baseline. This catches issues that PPL misses.
Not all layers are equally sensitive to quantization. Embedding layers and LM head (output projection) are very sensitive; MLPs are less so. A smart strategy: quantize robust layers to INT4, keep sensitive layers in INT8 or even FP16.
This is called mixed-precision quantization. GGUF's K-quant variants (Q4_K_M, Q5_K_M, etc.) do this automatically — important layers get higher precision, unimportant ones get lower.
| Layer type | Sensitivity | Recommended min bits |
|---|---|---|
| Embedding / LM head | Very high | INT8 or BF16 |
| Attention Q/K/V projections | High | INT8 (with care on INT4) |
| Attention output projection | High | INT8 |
| MLP gate / up projections | Low | INT4 safe |
| MLP down projection | Medium | INT4 acceptable |
| LayerNorm | Very high | BF16 (never quantize) |
Start with uniform INT4, measure PPL loss. If it's >10%, switch sensitive layers (embedding, attention outputs) to INT8. This typically recovers 50% of the quality loss. Most practical frameworks support this directly; you just specify which layers to quantize to which bit width.
Not all quantization formats run fast on all hardware. Some GPUs have native INT8 tensor cores; others require software emulation. Choosing the right format for your hardware matters.
| Hardware | INT8 GEMM | INT4 GEMM | FP8 | Notes |
|---|---|---|---|---|
| NVIDIA A100 | ✓ | Software | ✗ | INT8 via cuBLAS, fast enough |
| NVIDIA H100 | ✓ | ✓ | ✓ | Native support for all formats |
| NVIDIA H200 | ✓ | ✓ | ✓ | Same as H100 |
| Apple M-series | ✓ | ✓ | ✗ | via Metal / llama.cpp |
| AMD MI300X | ✓ | ✓ | ✓ | via ROCm, comparable to H100 |
FP8 on H100: Both training and inference support FP8 natively. This is becoming the new frontier-model standard — Llama 3.1, GPT-4o, and others use FP8 internally for efficiency.
INT4 inference: NVIDIA A100 doesn't have INT4 tensor cores, so INT4 inference uses software dequantization (float multiply). Still fast because dequantization is simple, but not as fast as hardware-native formats.
Quantization decisions depend on your constraints: VRAM, latency, quality requirements, and available hardware. Here's a framework to decide what to quantize and how.
Consumer deployment (1×consumer GPU, 24GB VRAM): INT4 GPTQ or Q4_K_M GGUF. Perplexity loss ~7%, acceptable for most tasks. Throughput: 5–10 tokens/sec.
Production API (1×A100, 80GB, 20 concurrent requests): INT4 AWQ weights with INT8 activations (SmoothQuant if you have H100). Calibrate on your actual data. Throughput: 50–100 tokens/sec across batch.
Cost-optimal (multiple smaller GPUs): Mixed quantization. Heavy layers INT8, lightweight INT4. Reduces memory vs. INT4 uniform, quality between INT4 and INT8.
Research / evaluation: Start with GPTQ (standard baseline). If quality is insufficient, try AWQ or move to QAT if budget allows. Measure on your eval set, not defaults.
Quantization sits at the intersection of deep learning math and GPU hardware. Here's how to build up to it:
Use BitsAndBytesConfig(load_in_4bit=True) with a Llama 3 8B model. Confirm it fits on your GPU and produces coherent output. Takes 15 minutes.
Run a simple benchmark (MMLU sample or your task) at FP16 vs. 4-bit. The gap will usually be <2% on general tasks, larger on math-heavy ones.
AWQ calibrates quantization on a small dataset to minimise the accuracy loss. Use autoawq. This is the recommended path for serving.
llama.cpp with GGUF models is the fastest path to running models on Mac (Metal) or CPU. Q4_K_M is the best quality/size tradeoff for most use cases.