AWQ

AWQ algorithm
Quantising with llm-awq
Serving AWQ models
AWQ vs GPTQ quality comparison
AutoAWQ library
WQA: weight-only vs activation quantisation
Gotchas

SECTION 01

AWQ algorithm

AWQ (Lin et al. 2023) addresses a key observation: not all LLM weights are equally important. A small fraction (~1%) of weights — those corresponding to channels with large activation magnitudes — have disproportionate impact on model quality. Quantising these with more precision (or protecting them) preserves most of the model's capability.

Rather than keeping those weights in fp16 (which wastes memory), AWQ finds a per-channel scale factor that makes salient channels "easier to quantise" — by rescaling them before quantisation and rescaling back after, effectively giving them finer quantisation granularity without extra storage. The scale factors are determined by a grid search over activation-calibrated candidates.

This calibration step is fast (minutes vs hours for GPTQ) and doesn't require backprop. The result is a 4-bit model that typically outperforms GPTQ-4bit on standard benchmarks, especially on instruction following and reasoning tasks.

SECTION 02

Quantising with llm-awq

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
quant_path = "llama3-8b-awq"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    low_cpu_mem_usage=True,
    use_cache=False,
)

quant_config = {
    "zero_point": True,   # asymmetric quantisation (better quality)
    "q_group_size": 128,  # group size for scale factors
    "w_bit": 4,           # 4-bit weights
    "version": "GEMM",    # GEMM or GEMV kernel (GEMM for batch>1)
}

# Calibration data
calib_data = ["AWQ quantisation uses calibration data to find optimal scales.",
              "The quick brown fox jumps over the lazy dog."]  # use 128+ diverse samples

model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)
model.save_quantized(quant_path, safetensors=True)
tokenizer.save_pretrained(quant_path)
print("AWQ quantisation done!")

SECTION 03

Serving AWQ models

# Load and run inference with AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "llama3-8b-awq",
    fuse_layers=True,    # fuse QKV + FFN layers for 1.5× speedup
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("llama3-8b-awq")

inputs = tokenizer("Explain gradient descent in one sentence:", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

# Via vLLM (recommended for production)
vllm serve ./llama3-8b-awq --quantization awq --dtype half

# Via TGI
docker run --gpus all -p 8080:80   ghcr.io/huggingface/text-generation-inference   --model-id ./llama3-8b-awq --quantize awq

SECTION 04

AWQ vs GPTQ quality comparison

On standard benchmarks (MMLU, HumanEval, MT-Bench), AWQ-4bit consistently outperforms GPTQ-4bit by 1–3 percentage points. The difference is larger for smaller models (7B) than larger ones (70B), likely because larger models have more redundancy and are more robust to quantisation noise.

Both methods significantly outperform naive round-to-nearest (RTN) quantisation, which degrades heavily below 6-bit. At 3-bit, AWQ still maintains reasonable quality while GPTQ-3bit degrades more.

For production deployment, if a pre-quantised AWQ model exists for your chosen base model, use it. The quality advantage over GPTQ is real and costs nothing.

SECTION 05

AutoAWQ library

AutoAWQ is the main Python library for AWQ quantisation. Key features:

GEMM kernel: Optimised for batch sizes > 1 (typical serving scenario). Use version="GEMM".
GEMV kernel: Optimised for batch size = 1 (decode step, edge deployment). Fastest for single-token generation.
Fused layers: fuse_layers=True merges QKV projections and FFN into single CUDA ops. Requires version="GEMM". Gives ~1.5× speedup.
ExLlamaV2 support: For even faster inference, combine AWQ weights with ExLlamaV2 kernels.

pip install autoawq autoawq-kernels
# Or install from source for latest features:
pip install git+https://github.com/casper-hansen/AutoAWQ.git

SECTION 06

Weight-only vs activation quantisation

AWQ and GPTQ are weight-only quantisation: weights are stored in 4-bit, but activations (the layer outputs computed during inference) remain in fp16. This is the dominant approach for LLMs because: (1) activations have much larger dynamic range than weights and are harder to quantise, (2) computing activations in fp16 uses the GPU's fast fp16 tensor cores, (3) the memory bottleneck in autoregressive generation is weight loading, not activation storage.

In contrast, W8A8 quantisation (SmoothQuant, LLM.int8) quantises both weights and activations to int8, using integer tensor cores that can be 2× faster on supported hardware. This is more complex but gives higher throughput on datacenter GPUs with strong int8 support (A100, H100).

SECTION 07

Gotchas

Version compatibility: AutoAWQ versions don't always stay compatible with transformers updates. Pin both versions together.

GEMM vs GEMV kernel selection: If you're running single-request inference (batch=1), use version="GEMV" for 20–30% faster generation. For batched serving, use version="GEMM".

Fused layers break some features: fuse_layers=True is incompatible with output_hidden_states and some attention visualization tools. Disable for debugging.

Pre-quantised model quality: The quality of a pre-quantised model depends on the calibration data used. Models quantised with domain-specific data will perform better on that domain. Check the model card for calibration details.

AWQ vs. GPTQ Quantization Comparison

Activation-Aware Weight Quantization (AWQ) is a post-training quantization method that identifies and preserves the weights most important for model quality based on the magnitude of corresponding activation values. By keeping only 1% of weights at full precision and quantizing the remaining 99% to 4 bits, AWQ achieves comparable quality to GPTQ with faster quantization and efficient inference support.

Property	AWQ	GPTQ	bitsandbytes NF4
Algorithm	Activation-guided scaling	Second-order (Hessian)	NF4 per-block
Quantization time	~30 min (7B)	~2 hours (7B)	On-the-fly
Quality at 4-bit	Very good	Very good	Good
Inference backend	AutoAWQ, vLLM	AutoGPTQ, ExLlama	Transformers
Merging with LoRA	Dequantize first	Dequantize first	QLoRA native

AWQ's core insight is that not all weights contribute equally to model output quality, and the most salient weights are those that correspond to activations with large magnitudes. Rather than quantizing all weights with equal precision — which wastes precision bits on unimportant weights — AWQ scales these important weight channels before quantization so they map to quantization levels more faithfully. The scaling is calibrated using a small dataset of activation statistics, requiring only minutes of compute compared to GPTQ's more expensive Hessian computation.

AWQ's strong support in the vLLM serving stack is a practical advantage for production deployments. vLLM can serve AWQ-quantized models directly with efficient custom CUDA kernels, achieving throughput close to FP16 inference at a fraction of the memory requirement. For organizations that use vLLM as their serving infrastructure, AWQ is often the preferred quantization format because the complete pipeline from quantization to deployment is well-documented and tested, reducing integration risk compared to less common quantization formats.

AWQ quantization quality is relatively insensitive to the calibration dataset size above approximately 128 examples. The activation statistics used to identify important weight channels stabilize quickly as more calibration examples are processed, meaning that adding more data beyond this threshold produces diminishing quality improvements. A representative 128-example sample from the target domain produces better AWQ models than a large generic calibration set, because domain-specific activation patterns may highlight different important weight channels than general-purpose text activations.

AWQ's efficiency advantages compound for serving workloads with many concurrent users. The smaller model footprint from 4-bit quantization allows more request batches to be processed simultaneously in the GPU's VRAM, increasing throughput even beyond the direct speedup from quantized matrix multiplication. For 70B models that would require two A100 80GB GPUs in FP16, AWQ quantization enables single-GPU serving on the same hardware, halving infrastructure costs for services that don't require the throughput of multi-GPU deployment.

Integrating AWQ with serving frameworks requires checking model format compatibility. vLLM natively supports AWQ through the AutoAWQ quantization format, loading quantized checkpoints directly without any conversion step. TGI and SGLang also support AWQ through different backend implementations. When evaluating AWQ for a specific deployment target, verifying that the intended serving framework has production-tested AWQ support avoids last-minute compatibility issues that can delay production deployments.

AWQ quantization artifacts — cases where specific weights are quantized poorly and produce noticeably degraded outputs on certain inputs — can be diagnosed using per-layer perplexity analysis. Computing perplexity on a validation set with each transformer layer's activations logged separately identifies which layers have the highest post-quantization perplexity increase. Layers showing large perplexity spikes are candidates for mixed-precision treatment — quantizing them to 8-bit rather than 4-bit while keeping other layers at 4-bit. This targeted mixed-precision approach recovers most of the quality loss from aggressive quantization with only a modest increase in model size.

AWQ checkpoints are portable across devices with the same quantization configuration, simplifying deployment workflows. A model quantized on a high-memory development machine with 80GB A100 GPUs can be deployed directly on inference hardware with smaller VRAM budgets without re-quantization, as long as the serving framework supports the AWQ format. This portability eliminates the need to maintain separate quantization pipelines for development and production environments.

Table of Contents