Serving Engines

TGI (HF)

Text Generation Inference — Hugging Face's production-grade serving engine for LLMs. Continuous batching, tensor parallelism, FlashAttention2, speculative decoding, and a clean OpenAI-compatible REST API out of the box.

Continuous
Batching
OpenAI compat
REST API
Multi-GPU
Tensor Parallel

Table of Contents

SECTION 01

What TGI does

Text Generation Inference (TGI) is Hugging Face's inference server for LLMs. It handles everything between your model weights and production traffic: continuous batching (processing multiple requests simultaneously without padding waste), token streaming via SSE, speculative decoding for faster generation, FlashAttention2 integration, and multi-GPU tensor parallelism. It serves an OpenAI-compatible HTTP API, so existing OpenAI clients work with minimal changes.

TGI powers the Hugging Face Inference Endpoints product and is widely used for self-hosting open models like Llama, Mistral, and Qwen. It supports bfloat16, float16, GPTQ, AWQ, and EETQ quantisation.

# Quick start — serve Llama 3 8B on a single GPU
docker run --gpus all -p 8080:80   -v $HF_HOME:/data   ghcr.io/huggingface/text-generation-inference:latest   --model-id meta-llama/Meta-Llama-3-8B-Instruct   --max-input-length 4096   --max-total-tokens 8192

The container auto-downloads the model, loads it to GPU, and starts serving at http://localhost:8080.

SECTION 02

Deploying with Docker

TGI is distributed as a Docker image. The latest tag supports most NVIDIA GPUs (Ampere+). For older GPUs or AMD ROCm, there are separate tags.

# Multi-GPU with tensor parallelism (2 GPUs)
docker run --gpus all -p 8080:80   -v $HF_HOME:/data   ghcr.io/huggingface/text-generation-inference:latest   --model-id meta-llama/Meta-Llama-3-70B-Instruct   --num-shard 2   --max-input-length 4096   --max-total-tokens 8192   --quantize bitsandbytes-nf4

# Test the endpoint
curl http://localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{"model":"tgi","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'

For GPTQ or AWQ models, use --quantize gptq or --quantize awq and ensure you're using a pre-quantised model from the Hub.

SECTION 03

Key configuration parameters

TGI exposes dozens of flags. The most impactful:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "Explain backpropagation briefly."}],
    max_tokens=300,
    temperature=0.7,
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
SECTION 04

Continuous batching explained

Traditional static batching waits to fill a batch, then runs all requests together to completion. This wastes GPU time: early-finishing sequences idle while long ones continue. TGI uses continuous batching (also called in-flight batching): when any sequence finishes, a new request immediately joins the active batch without waiting.

The KV-cache is managed dynamically using PagedAttention (borrowed from vLLM). Memory is allocated in fixed-size pages rather than per-sequence, allowing efficient sharing and reuse. This means TGI can sustain high throughput even with highly variable sequence lengths — a critical property for production APIs where users send everything from 10-token queries to 4000-token documents.

In practice, continuous batching gives 2–10× higher throughput than static batching for typical API workloads with mixed sequence lengths.

SECTION 05

Tensor parallelism

For models that don't fit on a single GPU, TGI supports tensor parallelism via --num-shard N. This splits attention heads and MLP layers across N GPUs, with each GPU holding 1/N of each layer. All-reduce operations synchronise activations between forward passes.

Requirements: all GPUs must be on the same node (NVLink preferred for best bandwidth). The number of GPUs must divide the number of attention heads evenly (e.g. 32 heads → 1, 2, 4, 8, or 16 GPUs).

# Check GPU topology first
nvidia-smi topo -m

# 4-GPU deployment of 70B model
docker run --gpus '"device=0,1,2,3"' -p 8080:80   -v $HF_HOME:/data   ghcr.io/huggingface/text-generation-inference:latest   --model-id meta-llama/Meta-Llama-3-70B-Instruct   --num-shard 4

For multi-node deployment (model too large for one node), TGI supports pipeline parallelism but this requires more complex orchestration with a separate launcher.

SECTION 06

Quantisation support

TGI supports several quantisation schemes, each with different trade-offs:

Recommendation: use AWQ for best quality-size trade-off, bnb-nf4 for flexibility when no pre-quantised model exists.

SECTION 07

Gotchas

OOM during startup: TGI pre-allocates KV-cache based on --max-batch-total-tokens. If you set this too high for your GPU, it OOMs at startup rather than at runtime. Start with a lower value and increase until you see OOM.

Model gates: Llama and some other models require accepting a HuggingFace license before downloading. Set HUGGING_FACE_HUB_TOKEN env var with a token that has accepted the license.

Version pinning: The latest tag changes frequently. Pin to a specific version (e.g. 2.4.0) in production.

Max tokens asymmetry: --max-input-length and --max-total-tokens must both be set. Total must be > input. A common mistake: setting input=4096 and total=4096, leaving zero room for output.

Tokenizer mismatch: If you use a chat template, TGI applies it automatically from the model's tokenizer config. If you pass already-formatted prompts AND enable the chat completions endpoint, you may get double-formatting.

TGI Deployment Configuration Reference

Text Generation Inference (TGI) is HuggingFace's high-performance serving framework for large language models. It implements continuous batching, tensor parallelism, and quantization support out of the box, making it one of the most widely deployed open-source LLM serving solutions in production environments.

ParameterEffectRecommended ValueTrade-off
--max-batch-total-tokensMax tokens across all batched requests16,000–32,000Higher = better GPU util, more memory
--max-input-lengthMaximum prompt length4096Higher = more memory per slot
--quantizeWeight quantization schemebitsandbytes / awqLower memory, slight quality loss
--num-shardTensor parallel degreeNumber of GPUsRequired for models larger than single GPU
--max-concurrent-requestsQueue depth cap128Higher = better throughput, more latency

TGI's continuous batching engine processes new requests as capacity becomes available rather than waiting for a fixed batch to fill up. This reduces the average time-to-first-token for interactive use cases while still achieving high throughput by keeping the GPU utilization high. The key metric to monitor in production TGI deployments is "batch size over time" — consistently low batch sizes indicate underloaded infrastructure; consistently maxed batch sizes with high queue depths indicate the need to scale out.

For multi-GPU deployments, TGI uses tensor parallelism to split model weights across GPUs, with each GPU holding a shard of every layer. This is different from pipeline parallelism, which assigns whole layers to different GPUs. Tensor parallelism requires fast inter-GPU communication (NVLink or NVSwitch) but provides better latency characteristics since all GPUs work on every request simultaneously rather than passing activations through a pipeline of stages.

Speculative decoding in TGI uses a small "draft" model to generate multiple candidate tokens, which are then verified in a single forward pass of the larger target model. For tasks where the draft model's predictions are frequently correct — such as code completion or continuation of factual text — speculative decoding can increase effective throughput by 2–3× with no change in output quality. The draft model must have the same vocabulary and tokenizer as the target model, making same-family model pairs (e.g., Llama 3 8B as draft for Llama 3 70B) the natural choice.

TGI exposes Prometheus metrics at the /metrics endpoint, including request queue depth, token generation rate, and per-request latency percentiles. Integrating these metrics with Grafana dashboards enables capacity planning based on observed traffic patterns. The most important SLO to track for interactive applications is the time-to-first-token (TTFT) at the 99th percentile — this determines the perceived responsiveness of the interface from the user's perspective, and is more sensitive to GPU memory pressure and batch scheduling decisions than median latency.