GGUF is the standard file format for quantised LLMs; llama.cpp is the C++ inference engine that runs them. Together they enable 4-bit quantised inference on CPUs and consumer GPUs — the foundation for local AI.
llama.cpp is a C++ library by Georgi Gerganov that runs LLM inference with minimal dependencies — no Python, no CUDA required (though it uses them if available). It was originally a proof-of-concept to run LLaMA on a MacBook. It became the de facto standard for local LLM inference because it's fast, memory-efficient, and runs everywhere: Mac, Windows, Linux, Raspberry Pi, even Android.
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It stores the model weights plus all metadata needed for inference (tokeniser, architecture params, quantisation scheme) in a single binary file. Before GGUF, the ecosystem used GGML — GGUF replaced it in August 2023 with better extensibility and metadata support.
Quantisation reduces model size by storing weights in lower precision: instead of 16 or 32 bits per weight, GGUF uses 4 or 8 bits. A 7B model that's 14GB in FP16 becomes 4GB in Q4. You lose some quality — typically 1–5% on benchmarks — but gain the ability to run on consumer hardware.
GGUF quantisation levels follow the naming convention Q[bits]_[type]:
Q2_K: 2 bits per weight. Extreme compression. Noticeable quality loss. Only useful when memory is the critical constraint.
Q4_K_M: 4 bits per weight, K-quant method, medium size. The sweet spot for most use cases. Roughly 40% performance of FP16 on quality; GGML benchmarks show ~2–3% accuracy loss on standard tasks. Recommended default.
Q4_K_S: 4-bit small. Slightly smaller file than Q4_K_M with marginally lower quality.
Q5_K_M: 5 bits. Better quality than Q4, fits in more memory than Q6/Q8. Good choice when you have 6–8GB VRAM.
Q6_K: 6 bits. Very close to FP16 quality. Recommended when memory allows.
Q8_0: 8 bits per weight. Negligible quality loss vs FP16. Twice the size of Q4. Choose this for maximum quality with local inference.
F16: Full 16-bit precision. Same quality as the source model. 2× memory of Q8. Use only with large VRAM setups.
# Install llama.cpp command line tool
# macOS: brew install llama.cpp
# Linux: build from source or use pre-built binaries from GitHub releases
# Download a GGUF model from HuggingFace
# huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF # --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" --local-dir ./models
# Basic inference
# llama-cli -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf # -p "Write a Python hello world" --temp 0.7 -n 256
# Start a server (OpenAI-compatible REST API)
# llama-server -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf # --port 8080 --ctx-size 4096 --n-gpu-layers 35
# The server is then accessible as an OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Explain backpropagation briefly."}]
)
print(response.choices[0].message.content)
pip install llama-cpp-python # CPU only
# For GPU: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
n_ctx=4096, # Context window size
n_gpu_layers=35, # Number of layers to offload to GPU (0 = CPU only)
n_threads=8, # CPU threads for non-GPU layers
verbose=False,
)
# Chat completion
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful Python developer."},
{"role": "user", "content": "Write a function to parse JSON safely."}
],
temperature=0.1,
max_tokens=512,
)
print(response["choices"][0]["message"]["content"])
# Streaming
for chunk in llm.create_chat_completion(
messages=[{"role": "user", "content": "List 5 Python best practices."}],
stream=True
):
delta = chunk["choices"][0]["delta"]
if "content" in delta:
print(delta["content"], end="", flush=True)
# Embeddings
embedding = llm.create_embedding("The quick brown fox")["data"][0]["embedding"]
print(f"Embedding dimension: {len(embedding)}")
# Convert any HuggingFace model to GGUF (requires llama.cpp source)
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp && pip install -r requirements.txt
# Step 1: Download model from HuggingFace
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="meta-llama/Llama-3.1-8B-Instruct",
local_dir="./hf_model",
token="hf_..." # HF token for gated models
)
# Step 2: Convert to F16 GGUF
import subprocess
subprocess.run([
"python3", "llama.cpp/convert_hf_to_gguf.py",
"./hf_model",
"--outfile", "./models/llama3.1-8b-f16.gguf",
"--outtype", "f16"
])
# Step 3: Quantise to Q4_K_M
subprocess.run([
"llama.cpp/llama-quantize",
"./models/llama3.1-8b-f16.gguf",
"./models/llama3.1-8b-q4km.gguf",
"Q4_K_M"
])
# Check file sizes
import os
for f in ["llama3.1-8b-f16.gguf", "llama3.1-8b-q4km.gguf"]:
size = os.path.getsize(f"./models/{f}") / 1e9
print(f"{f}: {size:.1f} GB")
# llama3.1-8b-f16.gguf: 16.1 GB
# llama3.1-8b-q4km.gguf: 4.7 GB
The golden rule: the model must fit in memory (VRAM + RAM). Layers not in VRAM run on CPU — dramatically slower.
CPU-only (no GPU): Q4_K_M models generate 5–15 tokens/second on modern CPUs. Acceptable for batch processing; too slow for interactive chat. Apple Silicon Macs are a special case: M-series chips have unified memory shared between CPU and GPU, giving 30–80 tok/s even "CPU-only".
Consumer GPUs (8–12GB VRAM): RTX 3060/4060/3080 12GB. Fits 7B models at Q4_K_M with full GPU offload (100+ tok/s). 13B at Q4_K_M needs 8GB (tight). Use n_gpu_layers=-1 for full offload.
Professional GPUs (24GB VRAM): RTX 4090, A5000. Fits 13B at Q8, 34B at Q4. Best cost-efficiency for local deployment.
Server GPUs (40–80GB VRAM): A100, H100. Runs 70B at Q4_K_M or full F16 for 30B models. Multi-GPU setups enable larger models via tensor parallelism.
n_gpu_layers needs tuning per model and GPU. Setting n_gpu_layers=-1 (offload everything) sounds ideal but will OOM if the model doesn't fit. Use n_gpu_layers=0 to test CPU-only, then increase until you hit VRAM limits. Rule of thumb: each layer of a 7B model uses roughly 120MB of VRAM in Q4.
Context size multiplies memory usage. n_ctx=32768 (32K context) uses significantly more memory than n_ctx=4096 — the KV cache scales with context length × batch size. For memory-constrained hardware, set n_ctx to only what you need.
Model quality varies by quantisation and task. Q4_K_M is excellent for coding and factual tasks but can degrade noticeably on nuanced reasoning compared to Q8. Always benchmark the quantised model on a representative sample of your actual use cases before committing to a quantisation level in production.
GGUF (GPT-Generated Unified Format) is the standard file format for quantized models compatible with llama.cpp and its ecosystem. It packages model weights, tokenizer, and metadata into a single portable file that can be loaded and run on CPU, GPU, or mixed CPU+GPU configurations with no additional dependencies beyond the llama.cpp runtime.
| Quantization | Bits/Weight | File Size (7B) | Quality vs FP16 | Best For |
|---|---|---|---|---|
| Q2_K | ~2.6 | ~2.8GB | Noticeable degradation | Extreme memory limits |
| Q4_K_M | ~4.8 | ~4.1GB | Good | Recommended default |
| Q5_K_M | ~5.7 | ~4.8GB | Very good | Quality-focused |
| Q6_K | ~6.6 | ~5.5GB | Near lossless | Max quality local |
| Q8_0 | 8 | ~7.0GB | Near identical to FP16 | Reference quality |
The K-quants (Q4_K_M, Q5_K_M, Q6_K) use a more sophisticated quantization scheme that applies different bit widths to different weight groups based on their sensitivity to precision loss. Attention projection weights, which are most sensitive, receive higher precision allocation; feed-forward weights, which tolerate more compression, receive lower precision. This mixed-precision approach within a single quantization level produces better quality-per-bit than uniform quantization schemes like Q4_0.
GPU offloading with llama.cpp is controlled by the n_gpu_layers parameter, which specifies how many transformer layers to load onto the GPU. Setting n_gpu_layers=-1 offloads all layers to GPU (fastest, requires full model VRAM); partial offloading (e.g., n_gpu_layers=20) runs some layers on GPU and the rest on CPU. The optimal split depends on available VRAM — keeping as many layers on GPU as possible while leaving ~1GB of VRAM headroom for the KV cache and activation buffers produces the best latency/memory trade-off.
Context length and KV cache memory scale linearly with the configured n_ctx parameter in llama.cpp. A 7B model with Q4_K_M quantization uses about 4GB for weights; adding an 8K context window for a batch of 4 concurrent requests adds another 1–2GB of KV cache memory. Planning context allocation is essential for avoiding OOM errors during peak load — setting n_ctx too generously for a serving scenario with multiple concurrent users can exhaust available VRAM and cause inference failures on requests that arrive while long contexts are in flight.
llama.cpp's server mode provides an OpenAI-compatible REST API that allows any application built against the OpenAI SDK to run against a local GGUF model with only a base URL change. The server supports parallel slot processing, where multiple requests share the GPU compute in a round-robin scheduling pattern. For CPU-only deployments on systems with many cores, the --threads parameter should be set to the number of physical cores (not hyperthreads), as hyperthreading provides minimal benefit for the matrix multiplication operations that dominate LLM inference workloads.
Batch inference with llama.cpp processes multiple prompts in a single forward pass when the --parallel flag is set. This is particularly useful for offline batch evaluation, dataset annotation, or generating multiple completions for the same prompt for self-consistency voting. The batch size is bounded by available KV cache memory — each concurrent sequence requires its own KV cache slots — so larger batches require either more memory or a smaller per-sequence context allocation.