Infrastructure

Constrained Decoding

Techniques that guide token-level generation to produce structured outputs (JSON, SQL, regex patterns) with 100% format compliance — eliminating post-processing parse errors.

Format compliance
~100%
Latency overhead
<5%
Key libraries
Outlines, Guidance, llama.cpp

Table of Contents

SECTION 01

The Parsing Problem

Prompting a model to output JSON works ~90% of the time. The remaining 10% produces invalid JSON, truncated objects, or prose with JSON embedded. At scale, 10% failure means constant retries, error handlers, and dropped requests. Constrained decoding eliminates format errors entirely by controlling which tokens are valid at each generation step.

SECTION 02

How Constrained Decoding Works

At each decoding step, the model produces logits over its entire vocabulary. Constrained decoding applies a mask that zeros out logits for tokens that would violate the current constraint. Only valid tokens can be sampled. The mask is derived from a state machine (FSM) or grammar that tracks where we are in the output structure. No resampling needed — every output is valid by construction.

SECTION 03

Grammar-Based Constraints

Context-free grammars (CFGs) define the full syntax of the output language. " "llama.cpp's GBNF grammar format lets you specify any output structure. " "Valid tokens at each step are those consistent with some parse of the grammar from the current state.

# GBNF grammar for a simple JSON object
GRAMMAR = '''
root   ::= object
object ::= "{" ws members ws "}"
members ::= pair ("," ws pair)*
pair   ::= string ":" ws value
value  ::= string | number | "true" | "false" | "null"
string ::= "\"" [^\"]* "\""
number ::= [0-9]+
ws     ::= [ \t\n]*
'''
# With llama.cpp Python bindings:
# llm(prompt, grammar=LlamaGrammar.from_string(GRAMMAR))
SECTION 04

JSON Schema Enforcement

The most common use case: enforce a Pydantic model or JSON schema. " "Libraries like Outlines and Guidance make this one-liner simple.

from pydantic import BaseModel
from typing import Literal
import outlines
class ProductReview(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    score: int  # 1-5
    summary: str
    issues: list[str]
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
generator = outlines.generate.json(model, ProductReview)
review = generator(
    "Review: The delivery was fast but the product broke after a week. Rate it."
)
# review is guaranteed to be a valid ProductReview instance
print(type(review))   # <class 'ProductReview'>
print(review.sentiment)  # 'negative'
SECTION 05

Regex & CFG Patterns

Regex constraints enforce patterns like phone numbers, dates, or structured codes. " "Useful when full schema is overkill.

import outlines
model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")
# Force output to match an ISO date
date_generator = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
date = date_generator("When was the French Revolution? Answer with just the year: ")
print(date)  # e.g. "1789-07-14"
# Force choice from a fixed set
choice_gen = outlines.generate.choice(model, ["Yes", "No", "Maybe"])
answer = choice_gen("Is Python a compiled language?")
print(answer)  # "No"
SECTION 06

Production Libraries

Outlines: best for Hugging Face models, supports JSON schema / regex / CFG, very active community. Guidance: Microsoft library, integrates with Azure OpenAI and local models. llama.cpp GBNF: native to llama.cpp, works with any GGUF model, grammar files are portable. OpenAI response_format: not true constrained decoding but guided via fine-tuning; use json_schema mode for structured outputs with 99%+ compliance.

SECTION 07

Grammar-Based & Regex Constraints

Constrained decoding forces the model to generate output conforming to a schema: JSON, SQL, regex, or a context-free grammar. This is essential for downstream processing — you can't parse unpredictable formats. Implementation strategies vary: (1) Vocabulary masking: pre-filter the next token to only allow valid continuations (fast but limited), (2) FSM-guided decoding: represent the constraint as a finite state machine and only allow transitions to valid next states (more expressive), (3) Post-hoc validation: generate freely then validate and regenerate if invalid (flexible but slow). Most production systems combine approaches: use masking for speed, FSM for JSON/SQL, and post-hoc validation as a safety net.

Constraint TypeExpressivenessRuntime OverheadUse Case
RegexLimitedLowFormat enforcement
Context-Free GrammarHighMediumStructured data (JSON, SQL)
Vocabulary MaskVery LimitedVery LowClass selection
Beam Search + ScoringMediumHighQuality over speed
def constrained_generation(prompt: str, constraint_type: str = "json"):
    """Generate text constrained to JSON or regex patterns."""
    if constraint_type == "json":
        import interegular
        from lark import Lark
        grammar = Lark(JSON_GRAMMAR, start='value')
        # Use FSM from grammar to guide token selection
        fsm = interegular.parse_pattern(r"\{.*\}").to_fsm()
    response = client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": prompt}],
        max_tokens=500, temperature=0.7,
        # Guidance: only allow tokens consistent with FSM
    )
    return response.choices[0].message.content
SECTION 08

Performance & Trade-Offs

Constrained decoding trades off quality for validity. A model forced to generate valid JSON might produce less fluent text (lots of escaping, weird phrasing). The performance overhead depends on constraint complexity: simple vocabulary masks add <1% latency; complex grammars can 2–5x slow down generation. Empirically, sampling (temperature >0) within constraints produces more diverse valid outputs than greedy decoding. For latency-sensitive applications, pre-compute valid outputs (e.g., valid SQL queries from a schema) and do ranking rather than generation. For highest quality, allow unconstrained generation with a learned validator that scores outputs.

Implementing constrained generation efficiently is a systems problem. Naive approaches (generate, validate, retry) are slow—if 30% of unconstrained outputs are invalid, you need 1.43x more tokens and 43% more latency. Smart approaches integrate constraints into the generation process. Token masking (filtering invalid tokens before sampling) is fast but limited to local constraints (e.g., "if we just output '[', the next must be alphanumeric or '}'"). FSM-guided decoding (representing constraints as a finite state machine) is more expressive and still efficient for most practical cases. For complex constraints (arbitrary regex, context-free grammars), you need more sophisticated methods: attention-based finetuning (teach the model to follow constraints), or post-hoc rewriting (if the output violates constraints, edit it to fix violations).

Quality degradation under constraints is real. Models trained on unconstrained data sometimes struggle when forced into narrow output spaces. The solution is training with constraints: fine-tune on data that always respects the constraint. For example, if you want JSON output, fine-tune on JSON examples. This teaches the model "when constrained, use JSON patterns." Some frameworks (like VLLM's outlines) support this natively: they track which tokens are valid at each step and only sample from valid tokens. The downside: sometimes the model's best response isn't valid (e.g., "I can't answer this in JSON format"). Fallback strategies help: if the valid token set is empty, either expand the constraint (relax it slightly) or use a default valid token. Trade-offs are unavoidable.

Evaluating constrained generation requires checking both validity and quality. Validity is binary: does the output match the constraint? Quality is harder: does a valid but poor response count as success? Some pipelines use a two-stage approach: (1) generate with constraints for guaranteed validity, (2) score with a model or heuristic for quality. For structured data (JSON, SQL), you can validate syntax (does it parse?) and semantics (do the values make sense?). For less structured constraints (follow a regex), validity is all you have. In production, log the validity rate: if it drops below 95%, investigate whether the constraint is too tight or the model degraded. A single invalid output in a thousand might be acceptable; consistent invalidity is a red flag.

Practical implementation of constrained decoding varies by framework. vLLM's outlines library integrates constraints directly into sampling: given a regex or grammar, it computes the set of valid next tokens and only samples from that set. This is efficient because the FSM is precompiled. Hugging Face Transformers supports beam search with custom scoring functions: generate beams, score each by constraint compliance and model likelihood, rerank. This is slower but more flexible. For proprietary models (OpenAI), you're limited to post-hoc validation and regeneration. Some vendors offer "constrained APIs" (e.g., OpenAI's JSON mode forces valid JSON) where constraint enforcement is server-side. When choosing tools, check constraint support: does it handle your use case? How much overhead?

Constraints vary in complexity. Simple constraints (vocabulary, token length, format) are fast to enforce. Medium constraints (regex patterns, simple grammars) require FSM compilation and state tracking. Complex constraints (arbitrary logic, context-sensitive rules) might require custom decoding logic. For very complex constraints, consider relaxing: instead of "must be exactly valid JSON", relax to "should look like JSON" and accept some invalid outputs. For safety-critical tasks, strict constraints are necessary (e.g., medical coding must produce valid ICD codes). For generative tasks, soft constraints (guide but don't enforce) might improve quality. The trade-off is always speed vs. quality: enforcing tight constraints slows generation and can produce awkward outputs. Test empirically: measure speed and quality with different constraint looseness levels and pick the best trade-off.

Constraint complexity varies by use case. For APIs, you might want valid JSON with specific fields; regex constraints suffice. For code generation, you want syntactically valid code; a grammar-based approach is necessary. For structured data (CSV, tables), column-level constraints (this column must be non-negative integers) are important. For NLP tasks (NER, relation extraction), span-level constraints (entity must be one of {person, location, organization}) are needed. Different constraints require different decoding strategies. Unified frameworks (like outlines) support multiple constraint types; task-specific systems are often faster but less flexible. For new applications, start with simple constraints (vocabulary masking) and only escalate to complex constraints (grammars) if necessary. Constraints enable reliable systems but at a cost—measure both quality and latency and ensure constraints are worth the overhead.