Compression & Efficiency

Model Quantization

Reducing model memory by 2–8× with controlled accuracy loss — formats, methods, and tradeoffs when and how to quantize.

2–8× memory reduction
INT4 vs FP16 the common tradeoff
Post-training or QAT two main regimes
Contents
  1. Why quantize?
  2. Post-training quantization
  3. Quantization-aware training
  4. Calibration and quality
  5. Mixed precision
  6. Hardware acceleration
  7. Practical decision guide
01 — Motivation

Why Quantize?

Floating-point weights are the default in deep learning. A 7B parameter model in FP32 (32-bit floats) weighs 28 GB — too large for most consumer GPUs and expensive to serve. Even FP16 (16-bit, the current standard) is 14 GB for 7B, 140 GB for 70B.

Quantization reduces precision: convert FP32/FP16 weights to INT8, INT4, or custom formats like NF4. This cuts memory 2–8× with surprisingly small quality loss. The tradeoff: reduced precision costs some accuracy, and inference requires dequantization (fast but not free).

FormatBitsBytes/param7B size70B sizeQuality loss
FP3232428 GB280 GBBaseline
BF1616214 GB140 GBNegligible
FP8817 GB70 GBMinimal (<1%)
INT8817 GB70 GBMinimal (<1%)
INT4 / NF440.53.5 GB35 GBSmall (2–5%)
INT220.251.75 GB17.5 GBLarge (20%+)

INT4 is the sweet spot for most practitioners: 4× memory reduction, ~5% quality loss at worst. INT8 gives ~2× reduction with negligible loss. INT2 is rarely practical due to quality degradation.

Quantization unlocks deployment: INT4 lets you run 70B models on a single 80GB GPU (vs. two for FP16). This matters for cost, latency, and accessibility.
02 — No Retraining

Post-Training Quantization (PTQ)

Apply quantization after a model is fully trained. No retraining, no modification of the training loop. This is the default approach for open-source models: take a released weight file, convert it to INT4/INT8, and use it.

Weight-only quantization: quantize weights to INT4/INT8, keep activations (intermediate values) in FP16 during inference. The model dequantizes weights on the fly — slightly slower than native inference but much smaller.

Weight + activation quantization (W8A8): quantize both weights and activations to INT8. Requires hardware support (e.g., NVIDIA H100 has INT8 tensor cores) but can be significantly faster than weight-only on modern GPUs.

1

GPTQ — layer-wise with Hessian

Quantize each layer using information from the Hessian (second derivative) of the loss. This captures which weights are most critical and protects them with error correction. Gold standard for 4-bit weight-only quantization.

  • Standard for LLaMA, Mistral, and other OSS models
  • Slow to quantize (hours for 70B) but very fast inference
  • Excellent quality at INT4
Python · Load an AWQ-quantized model and compare output quality
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, time

model_id = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
prompt = "Explain transformer attention in exactly two sentences."

# Load AWQ-quantized model (4-bit, ~4GB vs ~14GB for fp16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Benchmark generation
start = time.perf_counter()
with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.0,
        do_sample=False
    )
elapsed = time.perf_counter() - start

response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:],
                             skip_special_tokens=True)
tokens_generated = output.shape[1] - inputs.input_ids.shape[1]
print(f"Output: {response}")
print(f"Speed: {tokens_generated/elapsed:.1f} tok/s")
print(f"Memory: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
# AWQ 4-bit: ~3.8GB, ~45 tok/s on A10G
# fp16: ~14GB,  ~18 tok/s on A10G
2

AWQ — activation-aware

Protect salient weights (those with high activation magnitudes) by quantizing less aggressively. Faster to quantize than GPTQ while maintaining similar quality. Becoming more common.

  • ~3–4× faster quantization than GPTQ
  • Comparable or slightly better quality than GPTQ
  • Growing adoption in newer releases
3

SmoothQuant — activation migration

Migrate quantization difficulty from activations to weights via per-channel scaling. Enables W8A8 (both weights and activations INT8) on hardware with INT8 support, achieving 2× speedup vs. weight-only.

  • Enables W8A8 inference on H100, H200
  • Requires calibration data and hardware support
  • Used in production deployments (TensorRT-LLM)
4

GGUF — CPU-friendly

Format designed for llama.cpp and local CPU/GPU inference. Supports mixed-precision quantization: different layers or even weight groups can use different bit widths (Q2_K, Q4_K_M, Q8_0, etc.). Pragmatic for end users.

  • Standard for llama.cpp ecosystem
  • Flexible per-layer quantization
  • Good CPU performance, runs on MacBooks

PTQ Trade-offs

Pros: No retraining needed, instant deployment. Cons: Slightly lower quality than QAT, requires careful calibration data selection. For most practitioners, PTQ is sufficient; only invest in QAT if quality plateaus.

03 — With Retraining

Quantization-Aware Training (QAT)

During training, simulate quantization noise using fake quantization ops. The model learns to be robust to the precision loss, so when you actually quantize post-training, quality is better than PTQ at the same bit width.

The cost: you must modify the training loop and run a full training pass again. For a 70B model, this takes weeks. QAT is primarily used by model labs (Google, Meta, Anthropic) who have the compute budget and ship quantized weights directly (e.g., Gemma INT8, Llama 2 Int8).

AspectPTQQAT
Retraining neededNoYes (full run)
Time to applyHoursDays–weeks
Quality at INT4Good (≈7.6 PPL)Better (≈7.2 PPL)
Who uses itMost OSS deploymentsModel labs (Google, Meta)
When to useDefault for servingWhen PTQ quality insufficient
⚠️ QAT is how Google ships Gemma-quantized models and how Meta ships quantized Llama variants. If you download Gemma-7B-it-int8 or Llama-2-70b-chat-int8 from Hugging Face, you're using QAT weights. Training is done; you just download and use them.

When QAT is Worth It

04 — Data Dependency

Calibration and Quality Measurement

PTQ methods need a calibration dataset to measure activation ranges and choose quantization thresholds. Pick a poor calibration set (random noise) and quality suffers. Pick a good one (representative of your use case) and quality is much better.

Typical calibration: 128–512 diverse samples from the same domain as your inference data. For general chat, use a mix of instructions and documents. For specialized domains, use domain-specific text.

## Quality Impact: Llama-3-8B Baseline (BF16): PPL on WikiText-2: 7.1 Quantization methods (GPTQ, AWQ, etc.): INT8: PPL = 7.2 +1.4% loss INT4 GPTQ: PPL = 7.6 +7.0% loss INT4 AWQ: PPL = 7.5 +5.6% loss INT2: PPL = 14.3 +101% loss (unusable)

Quality Metrics

Perplexity (PPL): Standard metric on WikiText-2 or C4. Lower is better. A 5–10% PPL increase is acceptable; >20% indicates poor quantization.

Downstream tasks: Run MMLU, HellaSwag, ARC to measure end-to-end impact. Some quantized models lose 1–2% accuracy on reasoning tasks but remain usable.

Human evaluation: For critical applications, have humans rate responses from the quantized model vs. baseline. This catches issues that PPL misses.

Rule of thumb: INT4 GPTQ/AWQ with good calibration data loses <7% PPL. If you're seeing >10% loss, your calibration data is bad or your model is particularly sensitive. Try different calibration sets.
05 — Selective Quantization

Mixed Precision and Layer Sensitivity

Not all layers are equally sensitive to quantization. Embedding layers and LM head (output projection) are very sensitive; MLPs are less so. A smart strategy: quantize robust layers to INT4, keep sensitive layers in INT8 or even FP16.

This is called mixed-precision quantization. GGUF's K-quant variants (Q4_K_M, Q5_K_M, etc.) do this automatically — important layers get higher precision, unimportant ones get lower.

Layer typeSensitivityRecommended min bits
Embedding / LM headVery highINT8 or BF16
Attention Q/K/V projectionsHighINT8 (with care on INT4)
Attention output projectionHighINT8
MLP gate / up projectionsLowINT4 safe
MLP down projectionMediumINT4 acceptable
LayerNormVery highBF16 (never quantize)

Mixed-Precision Strategy

Start with uniform INT4, measure PPL loss. If it's >10%, switch sensitive layers (embedding, attention outputs) to INT8. This typically recovers 50% of the quality loss. Most practical frameworks support this directly; you just specify which layers to quantize to which bit width.

⚠️ LayerNorm should never be quantized. These layers normalize activations and are extremely sensitive; even INT8 causes quality degradation. Always keep them in FP16 or FP32.
06 — GPU & CPU Support

Hardware Acceleration

Not all quantization formats run fast on all hardware. Some GPUs have native INT8 tensor cores; others require software emulation. Choosing the right format for your hardware matters.

HardwareINT8 GEMMINT4 GEMMFP8Notes
NVIDIA A100SoftwareINT8 via cuBLAS, fast enough
NVIDIA H100Native support for all formats
NVIDIA H200Same as H100
Apple M-seriesvia Metal / llama.cpp
AMD MI300Xvia ROCm, comparable to H100

FP8 on H100: Both training and inference support FP8 natively. This is becoming the new frontier-model standard — Llama 3.1, GPT-4o, and others use FP8 internally for efficiency.

INT4 inference: NVIDIA A100 doesn't have INT4 tensor cores, so INT4 inference uses software dequantization (float multiply). Still fast because dequantization is simple, but not as fast as hardware-native formats.

FP8 training and inference on H100 is becoming the new default for frontier labs. Check if your serving framework (vLLM, TensorRT-LLM, SGLang) supports FP8 before defaulting to INT4. On H100, FP8 may be faster and have better quality than INT4 weight-only.
07 — When and How

Practical Decision Guide

Quantization decisions depend on your constraints: VRAM, latency, quality requirements, and available hardware. Here's a framework to decide what to quantize and how.

llama.cpp
Local / CPU
GGUF format, mixed-precision, runs on MacBooks and consumer GPUs
AutoGPTQ
PTQ
GPTQ quantization, standard for Hugging Face OSS models
AutoAWQ
PTQ
AWQ quantization, faster than GPTQ, growing adoption
bitsandbytes
Fine-tuning
INT8/INT4 inference and fine-tuning, used in QLoRA
TensorRT-LLM
Production
W8A8, FP8, and SmoothQuant for high-performance inference
Quanto (HF)
Fine-tuning
Flexible quantization for training, integrates with Transformers

Decision Flowchart

Need to run 70B on 1×A100 (80GB)? → FP16 too big (140GB) → Try INT4 GPTQ/AWQ (~38GB for weights + activation cache) ✓ fits Need 7B on MacBook Pro (16GB)? → BF16 (14GB) fits, but tight → Q4_K_M GGUF (4.5GB) gives headroom ✓ recommended Need best quality at INT4? → Compare MMLU scores: AWQ usually ≥ GPTQ by 0.5–1% → Try both on your eval set ✓ use the better one Need fastest INT8 on H100? → SmoothQuant W8A8 via TensorRT-LLM → 2× faster than weight-only ✓ use W8A8 Running 100 concurrent requests? → KV cache is your bottleneck, not weights → Focus on cache optimization (prefix caching, paging) → Quantization helps less here ✓ manage cache first

Common Scenarios

Consumer deployment (1×consumer GPU, 24GB VRAM): INT4 GPTQ or Q4_K_M GGUF. Perplexity loss ~7%, acceptable for most tasks. Throughput: 5–10 tokens/sec.

Production API (1×A100, 80GB, 20 concurrent requests): INT4 AWQ weights with INT8 activations (SmoothQuant if you have H100). Calibrate on your actual data. Throughput: 50–100 tokens/sec across batch.

Cost-optimal (multiple smaller GPUs): Mixed quantization. Heavy layers INT8, lightweight INT4. Reduces memory vs. INT4 uniform, quality between INT4 and INT8.

Research / evaluation: Start with GPTQ (standard baseline). If quality is insufficient, try AWQ or move to QAT if budget allows. Measure on your eval set, not defaults.

⚠️ Quantized models are not drop-in replacements. Always benchmark quality on your specific use case before deploying to production. Perplexity is useful but may not capture your task's real quality loss.
08 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing
LEARNING PATH

Learning Path

Quantization sits at the intersection of deep learning math and GPU hardware. Here's how to build up to it:

Float FormatsFP32, FP16, BF16
INT8 Post-Trainingbitsandbytes
NF4 / GGUF4-bit loading
AWQ / GPTQcalibrated quant
ServingvLLM + quant
1

Load a model in 4-bit first

Use BitsAndBytesConfig(load_in_4bit=True) with a Llama 3 8B model. Confirm it fits on your GPU and produces coherent output. Takes 15 minutes.

2

Measure the quality cost

Run a simple benchmark (MMLU sample or your task) at FP16 vs. 4-bit. The gap will usually be <2% on general tasks, larger on math-heavy ones.

3

Learn AWQ for production

AWQ calibrates quantization on a small dataset to minimise the accuracy loss. Use autoawq. This is the recommended path for serving.

4

Use GGUF for local / edge

llama.cpp with GGUF models is the fastest path to running models on Mac (Metal) or CPU. Q4_K_M is the best quality/size tradeoff for most use cases.