Model Quantization

Contents

Why quantize?
Post-training quantization
Quantization-aware training
Calibration and quality
Mixed precision
Hardware acceleration
Practical decision guide

01 — Motivation

Why Quantize?

Floating-point weights are the default in deep learning. A 7B parameter model in FP32 (32-bit floats) weighs 28 GB — too large for most consumer GPUs and expensive to serve. Even FP16 (16-bit, the current standard) is 14 GB for 7B, 140 GB for 70B.

Quantization reduces precision: convert FP32/FP16 weights to INT8, INT4, or custom formats like NF4. This cuts memory 2–8× with surprisingly small quality loss. The tradeoff: reduced precision costs some accuracy, and inference requires dequantization (fast but not free).

Format	Bits	Bytes/param	7B size	70B size	Quality loss
FP32	32	4	28 GB	280 GB	Baseline
BF16	16	2	14 GB	140 GB	Negligible
FP8	8	1	7 GB	70 GB	Minimal (<1%)
INT8	8	1	7 GB	70 GB	Minimal (<1%)
INT4 / NF4	4	0.5	3.5 GB	35 GB	Small (2–5%)
INT2	2	0.25	1.75 GB	17.5 GB	Large (20%+)

INT4 is the sweet spot for most practitioners: 4× memory reduction, ~5% quality loss at worst. INT8 gives ~2× reduction with negligible loss. INT2 is rarely practical due to quality degradation.

✓ Quantization unlocks deployment: INT4 lets you run 70B models on a single 80GB GPU (vs. two for FP16). This matters for cost, latency, and accessibility.

02 — No Retraining

Post-Training Quantization (PTQ)

Apply quantization after a model is fully trained. No retraining, no modification of the training loop. This is the default approach for open-source models: take a released weight file, convert it to INT4/INT8, and use it.

Weight-only quantization: quantize weights to INT4/INT8, keep activations (intermediate values) in FP16 during inference. The model dequantizes weights on the fly — slightly slower than native inference but much smaller.

Weight + activation quantization (W8A8): quantize both weights and activations to INT8. Requires hardware support (e.g., NVIDIA H100 has INT8 tensor cores) but can be significantly faster than weight-only on modern GPUs.

GPTQ — layer-wise with Hessian

Quantize each layer using information from the Hessian (second derivative) of the loss. This captures which weights are most critical and protects them with error correction. Gold standard for 4-bit weight-only quantization.

Standard for LLaMA, Mistral, and other OSS models
Slow to quantize (hours for 70B) but very fast inference
Excellent quality at INT4

Python · Load an AWQ-quantized model and compare output quality

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, time

model_id = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
prompt = "Explain transformer attention in exactly two sentences."

# Load AWQ-quantized model (4-bit, ~4GB vs ~14GB for fp16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Benchmark generation
start = time.perf_counter()
with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.0,
        do_sample=False
    )
elapsed = time.perf_counter() - start

response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:],
                             skip_special_tokens=True)
tokens_generated = output.shape[1] - inputs.input_ids.shape[1]
print(f"Output: {response}")
print(f"Speed: {tokens_generated/elapsed:.1f} tok/s")
print(f"Memory: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
# AWQ 4-bit: ~3.8GB, ~45 tok/s on A10G
# fp16: ~14GB,  ~18 tok/s on A10G

AWQ — activation-aware

Protect salient weights (those with high activation magnitudes) by quantizing less aggressively. Faster to quantize than GPTQ while maintaining similar quality. Becoming more common.

~3–4× faster quantization than GPTQ
Comparable or slightly better quality than GPTQ
Growing adoption in newer releases

SmoothQuant — activation migration

Migrate quantization difficulty from activations to weights via per-channel scaling. Enables W8A8 (both weights and activations INT8) on hardware with INT8 support, achieving 2× speedup vs. weight-only.

Enables W8A8 inference on H100, H200
Requires calibration data and hardware support
Used in production deployments (TensorRT-LLM)

GGUF — CPU-friendly

Format designed for llama.cpp and local CPU/GPU inference. Supports mixed-precision quantization: different layers or even weight groups can use different bit widths (Q2_K, Q4_K_M, Q8_0, etc.). Pragmatic for end users.

Standard for llama.cpp ecosystem
Flexible per-layer quantization
Good CPU performance, runs on MacBooks

PTQ Trade-offs

Pros: No retraining needed, instant deployment. Cons: Slightly lower quality than QAT, requires careful calibration data selection. For most practitioners, PTQ is sufficient; only invest in QAT if quality plateaus.

03 — With Retraining

Quantization-Aware Training (QAT)

During training, simulate quantization noise using fake quantization ops. The model learns to be robust to the precision loss, so when you actually quantize post-training, quality is better than PTQ at the same bit width.

The cost: you must modify the training loop and run a full training pass again. For a 70B model, this takes weeks. QAT is primarily used by model labs (Google, Meta, Anthropic) who have the compute budget and ship quantized weights directly (e.g., Gemma INT8, Llama 2 Int8).

Aspect	PTQ	QAT
Retraining needed	No	Yes (full run)
Time to apply	Hours	Days–weeks
Quality at INT4	Good (≈7.6 PPL)	Better (≈7.2 PPL)
Who uses it	Most OSS deployments	Model labs (Google, Meta)
When to use	Default for serving	When PTQ quality insufficient

⚠️ QAT is how Google ships Gemma-quantized models and how Meta ships quantized Llama variants. If you download Gemma-7B-it-int8 or Llama-2-70b-chat-int8 from Hugging Face, you're using QAT weights. Training is done; you just download and use them.

When QAT is Worth It

You're training a new model from scratch and want to ship quantized.
You're fine-tuning a very large model and want the best quality at INT4.
You have compute budget (multiple TPU/GPU clusters) and months to spare.
You're a model lab producing a public release.

04 — Data Dependency

Calibration and Quality Measurement

PTQ methods need a calibration dataset to measure activation ranges and choose quantization thresholds. Pick a poor calibration set (random noise) and quality suffers. Pick a good one (representative of your use case) and quality is much better.

Typical calibration: 128–512 diverse samples from the same domain as your inference data. For general chat, use a mix of instructions and documents. For specialized domains, use domain-specific text.

## Quality Impact: Llama-3-8B Baseline (BF16): PPL on WikiText-2: 7.1 Quantization methods (GPTQ, AWQ, etc.): INT8: PPL = 7.2 +1.4% loss INT4 GPTQ: PPL = 7.6 +7.0% loss INT4 AWQ: PPL = 7.5 +5.6% loss INT2: PPL = 14.3 +101% loss (unusable)

Quality Metrics

Perplexity (PPL): Standard metric on WikiText-2 or C4. Lower is better. A 5–10% PPL increase is acceptable; >20% indicates poor quantization.

Downstream tasks: Run MMLU, HellaSwag, ARC to measure end-to-end impact. Some quantized models lose 1–2% accuracy on reasoning tasks but remain usable.

Human evaluation: For critical applications, have humans rate responses from the quantized model vs. baseline. This catches issues that PPL misses.

✓ Rule of thumb: INT4 GPTQ/AWQ with good calibration data loses <7% PPL. If you're seeing >10% loss, your calibration data is bad or your model is particularly sensitive. Try different calibration sets.

05 — Selective Quantization

Mixed Precision and Layer Sensitivity

Not all layers are equally sensitive to quantization. Embedding layers and LM head (output projection) are very sensitive; MLPs are less so. A smart strategy: quantize robust layers to INT4, keep sensitive layers in INT8 or even FP16.

This is called mixed-precision quantization. GGUF's K-quant variants (Q4_K_M, Q5_K_M, etc.) do this automatically — important layers get higher precision, unimportant ones get lower.

Layer type	Sensitivity	Recommended min bits
Embedding / LM head	Very high	INT8 or BF16
Attention Q/K/V projections	High	INT8 (with care on INT4)
Attention output projection	High	INT8
MLP gate / up projections	Low	INT4 safe
MLP down projection	Medium	INT4 acceptable
LayerNorm	Very high	BF16 (never quantize)

Mixed-Precision Strategy

Start with uniform INT4, measure PPL loss. If it's >10%, switch sensitive layers (embedding, attention outputs) to INT8. This typically recovers 50% of the quality loss. Most practical frameworks support this directly; you just specify which layers to quantize to which bit width.

⚠️ LayerNorm should never be quantized. These layers normalize activations and are extremely sensitive; even INT8 causes quality degradation. Always keep them in FP16 or FP32.

06 — GPU & CPU Support

Hardware Acceleration

Not all quantization formats run fast on all hardware. Some GPUs have native INT8 tensor cores; others require software emulation. Choosing the right format for your hardware matters.

Hardware	INT8 GEMM	INT4 GEMM	FP8	Notes
NVIDIA A100	✓	Software	✗	INT8 via cuBLAS, fast enough
NVIDIA H100	✓	✓	✓	Native support for all formats
NVIDIA H200	✓	✓	✓	Same as H100
Apple M-series	✓	✓	✗	via Metal / llama.cpp
AMD MI300X	✓	✓	✓	via ROCm, comparable to H100

FP8 on H100: Both training and inference support FP8 natively. This is becoming the new frontier-model standard — Llama 3.1, GPT-4o, and others use FP8 internally for efficiency.

INT4 inference: NVIDIA A100 doesn't have INT4 tensor cores, so INT4 inference uses software dequantization (float multiply). Still fast because dequantization is simple, but not as fast as hardware-native formats.

✓ FP8 training and inference on H100 is becoming the new default for frontier labs. Check if your serving framework (vLLM, TensorRT-LLM, SGLang) supports FP8 before defaulting to INT4. On H100, FP8 may be faster and have better quality than INT4 weight-only.

07 — When and How

Practical Decision Guide

Quantization decisions depend on your constraints: VRAM, latency, quality requirements, and available hardware. Here's a framework to decide what to quantize and how.

llama.cpp

Local / CPU

GGUF format, mixed-precision, runs on MacBooks and consumer GPUs

AutoGPTQ

PTQ

GPTQ quantization, standard for Hugging Face OSS models

AutoAWQ

PTQ

AWQ quantization, faster than GPTQ, growing adoption

bitsandbytes

Fine-tuning

INT8/INT4 inference and fine-tuning, used in QLoRA

TensorRT-LLM

Production

W8A8, FP8, and SmoothQuant for high-performance inference

Quanto (HF)

Fine-tuning

Flexible quantization for training, integrates with Transformers

Decision Flowchart

Need to run 70B on 1×A100 (80GB)? → FP16 too big (140GB) → Try INT4 GPTQ/AWQ (~38GB for weights + activation cache) ✓ fits Need 7B on MacBook Pro (16GB)? → BF16 (14GB) fits, but tight → Q4_K_M GGUF (4.5GB) gives headroom ✓ recommended Need best quality at INT4? → Compare MMLU scores: AWQ usually ≥ GPTQ by 0.5–1% → Try both on your eval set ✓ use the better one Need fastest INT8 on H100? → SmoothQuant W8A8 via TensorRT-LLM → 2× faster than weight-only ✓ use W8A8 Running 100 concurrent requests? → KV cache is your bottleneck, not weights → Focus on cache optimization (prefix caching, paging) → Quantization helps less here ✓ manage cache first

Common Scenarios

Consumer deployment (1×consumer GPU, 24GB VRAM): INT4 GPTQ or Q4_K_M GGUF. Perplexity loss ~7%, acceptable for most tasks. Throughput: 5–10 tokens/sec.

Production API (1×A100, 80GB, 20 concurrent requests): INT4 AWQ weights with INT8 activations (SmoothQuant if you have H100). Calibrate on your actual data. Throughput: 50–100 tokens/sec across batch.

Cost-optimal (multiple smaller GPUs): Mixed quantization. Heavy layers INT8, lightweight INT4. Reduces memory vs. INT4 uniform, quality between INT4 and INT8.

Research / evaluation: Start with GPTQ (standard baseline). If quality is insufficient, try AWQ or move to QAT if budget allows. Measure on your eval set, not defaults.

⚠️ Quantized models are not drop-in replacements. Always benchmark quality on your specific use case before deploying to production. Perplexity is useful but may not capture your task's real quality loss.

08 — Further Reading

References

Academic Papers

Paper Frantar, C., Ashkboos, S., et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323. — arxiv:2210.17323 ↗
Paper Lin, J., Tang, J., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978. — arxiv:2306.00978 ↗
Paper Wei, X., Zhang, Y., et al. (2022). SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438. — arxiv:2211.10438 ↗
Paper Jacob, B., et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018. arXiv:1806.08342. — arxiv:1806.08342 ↗

Documentation & Guides

Docs llama.cpp GGUF Format Quantization. github.com/ggerganov ↗
Docs HuggingFace AutoGPTQ. huggingface.co/docs ↗
Docs HuggingFace AutoAWQ. huggingface.co/docs ↗
Docs bitsandbytes Quantization and QLoRA. huggingface.co/docs/bitsandbytes ↗
Docs TensorRT-LLM Quantization. nvidia-opentask-lm.readthedocs.io ↗

Practitioner Writing

Blog HuggingFace. (2023). The Ultimate Guide to Model Quantization. — huggingface.co/blog ↗
Blog Huyen Chip. (2024). Quantization for LLMs: From Theory to Practice. — huyenchip.com ↗
Blog Together AI. (2023). Quantization in Production: Lessons from Running 70B Models. — together.ai ↗
Blog NVIDIA TensorRT-LLM. Quantization Best Practices. github.com/NVIDIA ↗

LEARNING PATH

Learning Path

Quantization sits at the intersection of deep learning math and GPU hardware. Here's how to build up to it:

Float FormatsFP32, FP16, BF16

→

INT8 Post-Trainingbitsandbytes

→

NF4 / GGUF4-bit loading

→

AWQ / GPTQcalibrated quant

→

ServingvLLM + quant

Load a model in 4-bit first

Use BitsAndBytesConfig(load_in_4bit=True) with a Llama 3 8B model. Confirm it fits on your GPU and produces coherent output. Takes 15 minutes.

Measure the quality cost

Run a simple benchmark (MMLU sample or your task) at FP16 vs. 4-bit. The gap will usually be <2% on general tasks, larger on math-heavy ones.

Learn AWQ for production

AWQ calibrates quantization on a small dataset to minimise the accuracy loss. Use autoawq. This is the recommended path for serving.

Use GGUF for local / edge

llama.cpp with GGUF models is the fastest path to running models on Mac (Metal) or CPU. Q4_K_M is the best quality/size tradeoff for most use cases.

Model Quantization

Why Quantize?

Post-Training Quantization (PTQ)

GPTQ — layer-wise with Hessian

AWQ — activation-aware

SmoothQuant — activation migration

GGUF — CPU-friendly

PTQ Trade-offs

Quantization-Aware Training (QAT)

When QAT is Worth It

Calibration and Quality Measurement

Quality Metrics

Mixed Precision and Layer Sensitivity

Mixed-Precision Strategy

Hardware Acceleration

Practical Decision Guide

Decision Flowchart

Common Scenarios

References

Learning Path

Load a model in 4-bit first

Measure the quality cost

Learn AWQ for production

Use GGUF for local / edge

Related concepts