Text Generation Inference — Hugging Face's production-grade serving engine for LLMs. Continuous batching, tensor parallelism, FlashAttention2, speculative decoding, and a clean OpenAI-compatible REST API out of the box.
Text Generation Inference (TGI) is Hugging Face's inference server for LLMs. It handles everything between your model weights and production traffic: continuous batching (processing multiple requests simultaneously without padding waste), token streaming via SSE, speculative decoding for faster generation, FlashAttention2 integration, and multi-GPU tensor parallelism. It serves an OpenAI-compatible HTTP API, so existing OpenAI clients work with minimal changes.
TGI powers the Hugging Face Inference Endpoints product and is widely used for self-hosting open models like Llama, Mistral, and Qwen. It supports bfloat16, float16, GPTQ, AWQ, and EETQ quantisation.
# Quick start — serve Llama 3 8B on a single GPU
docker run --gpus all -p 8080:80 -v $HF_HOME:/data ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Meta-Llama-3-8B-Instruct --max-input-length 4096 --max-total-tokens 8192
The container auto-downloads the model, loads it to GPU, and starts serving at http://localhost:8080.
TGI is distributed as a Docker image. The latest tag supports most NVIDIA GPUs (Ampere+). For older GPUs or AMD ROCm, there are separate tags.
# Multi-GPU with tensor parallelism (2 GPUs)
docker run --gpus all -p 8080:80 -v $HF_HOME:/data ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Meta-Llama-3-70B-Instruct --num-shard 2 --max-input-length 4096 --max-total-tokens 8192 --quantize bitsandbytes-nf4
# Test the endpoint
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"tgi","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'
For GPTQ or AWQ models, use --quantize gptq or --quantize awq and ensure you're using a pre-quantised model from the Hub.
TGI exposes dozens of flags. The most impactful:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "Explain backpropagation briefly."}],
max_tokens=300,
temperature=0.7,
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Traditional static batching waits to fill a batch, then runs all requests together to completion. This wastes GPU time: early-finishing sequences idle while long ones continue. TGI uses continuous batching (also called in-flight batching): when any sequence finishes, a new request immediately joins the active batch without waiting.
The KV-cache is managed dynamically using PagedAttention (borrowed from vLLM). Memory is allocated in fixed-size pages rather than per-sequence, allowing efficient sharing and reuse. This means TGI can sustain high throughput even with highly variable sequence lengths — a critical property for production APIs where users send everything from 10-token queries to 4000-token documents.
In practice, continuous batching gives 2–10× higher throughput than static batching for typical API workloads with mixed sequence lengths.
For models that don't fit on a single GPU, TGI supports tensor parallelism via --num-shard N. This splits attention heads and MLP layers across N GPUs, with each GPU holding 1/N of each layer. All-reduce operations synchronise activations between forward passes.
Requirements: all GPUs must be on the same node (NVLink preferred for best bandwidth). The number of GPUs must divide the number of attention heads evenly (e.g. 32 heads → 1, 2, 4, 8, or 16 GPUs).
# Check GPU topology first
nvidia-smi topo -m
# 4-GPU deployment of 70B model
docker run --gpus '"device=0,1,2,3"' -p 8080:80 -v $HF_HOME:/data ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Meta-Llama-3-70B-Instruct --num-shard 4
For multi-node deployment (model too large for one node), TGI supports pipeline parallelism but this requires more complex orchestration with a separate launcher.
TGI supports several quantisation schemes, each with different trade-offs:
-GPTQ suffix). Better quality than bnb at same bit-width.Recommendation: use AWQ for best quality-size trade-off, bnb-nf4 for flexibility when no pre-quantised model exists.
OOM during startup: TGI pre-allocates KV-cache based on --max-batch-total-tokens. If you set this too high for your GPU, it OOMs at startup rather than at runtime. Start with a lower value and increase until you see OOM.
Model gates: Llama and some other models require accepting a HuggingFace license before downloading. Set HUGGING_FACE_HUB_TOKEN env var with a token that has accepted the license.
Version pinning: The latest tag changes frequently. Pin to a specific version (e.g. 2.4.0) in production.
Max tokens asymmetry: --max-input-length and --max-total-tokens must both be set. Total must be > input. A common mistake: setting input=4096 and total=4096, leaving zero room for output.
Tokenizer mismatch: If you use a chat template, TGI applies it automatically from the model's tokenizer config. If you pass already-formatted prompts AND enable the chat completions endpoint, you may get double-formatting.
Text Generation Inference (TGI) is HuggingFace's high-performance serving framework for large language models. It implements continuous batching, tensor parallelism, and quantization support out of the box, making it one of the most widely deployed open-source LLM serving solutions in production environments.
| Parameter | Effect | Recommended Value | Trade-off |
|---|---|---|---|
| --max-batch-total-tokens | Max tokens across all batched requests | 16,000–32,000 | Higher = better GPU util, more memory |
| --max-input-length | Maximum prompt length | 4096 | Higher = more memory per slot |
| --quantize | Weight quantization scheme | bitsandbytes / awq | Lower memory, slight quality loss |
| --num-shard | Tensor parallel degree | Number of GPUs | Required for models larger than single GPU |
| --max-concurrent-requests | Queue depth cap | 128 | Higher = better throughput, more latency |
TGI's continuous batching engine processes new requests as capacity becomes available rather than waiting for a fixed batch to fill up. This reduces the average time-to-first-token for interactive use cases while still achieving high throughput by keeping the GPU utilization high. The key metric to monitor in production TGI deployments is "batch size over time" — consistently low batch sizes indicate underloaded infrastructure; consistently maxed batch sizes with high queue depths indicate the need to scale out.
For multi-GPU deployments, TGI uses tensor parallelism to split model weights across GPUs, with each GPU holding a shard of every layer. This is different from pipeline parallelism, which assigns whole layers to different GPUs. Tensor parallelism requires fast inter-GPU communication (NVLink or NVSwitch) but provides better latency characteristics since all GPUs work on every request simultaneously rather than passing activations through a pipeline of stages.
Speculative decoding in TGI uses a small "draft" model to generate multiple candidate tokens, which are then verified in a single forward pass of the larger target model. For tasks where the draft model's predictions are frequently correct — such as code completion or continuation of factual text — speculative decoding can increase effective throughput by 2–3× with no change in output quality. The draft model must have the same vocabulary and tokenizer as the target model, making same-family model pairs (e.g., Llama 3 8B as draft for Llama 3 70B) the natural choice.
TGI exposes Prometheus metrics at the /metrics endpoint, including request queue depth, token generation rate, and per-request latency percentiles. Integrating these metrics with Grafana dashboards enables capacity planning based on observed traffic patterns. The most important SLO to track for interactive applications is the time-to-first-token (TTFT) at the 99th percentile — this determines the perceived responsiveness of the interface from the user's perspective, and is more sensitive to GPU memory pressure and batch scheduling decisions than median latency.