Structured Generation

LLM Output Control

JSON mode, constrained decoding, grammar-guided generation, and structured output libraries

prompt → schema → validated output the guarantee chain
logit masking how constraints work
Pydantic + LLM the production standard
Contents
  1. The problem with free-text output
  2. Native structured output modes
  3. Instructor library
  4. Constrained decoding with Outlines
  5. Grammar-guided generation and LMQL
  6. Classification and choice constraints
  7. Production patterns
01 — Core Challenge

The Problem with Free-Text Output

LLMs produce free text by default. Production systems need structured data: JSON objects, typed fields, validated schemas.

Three approaches in order of reliability: (1) prompt engineering ("respond only in JSON"), (2) native structured output modes (OpenAI/Anthropic), (3) constrained decoding (grammar-guided token sampling).

Why prompting alone fails: model can still produce preamble ("Sure! Here's the JSON: ..."), truncate mid-object, or add trailing text. Brittle at scale.

⚠️ Never use json.loads() on raw LLM output in production without a try/except and retry loop. Even "JSON mode" can produce subtly malformed output on edge cases.
02 — Provider-Native

Native Structured Output Modes

OpenAI response_format: set to {"type": "json_object"} for best-effort JSON, or use response_format=MyPydanticModel with beta.chat.completions.parse for strict schema enforcement

Anthropic: pass tool definitions — model always returns valid tool_use blocks matching the schema

Strict mode (OpenAI): guarantees output exactly matches your JSON Schema. No extra fields, no missing required fields. Uses constrained decoding under the hood.

OpenAI Strict Structured Output

from openai import OpenAI from pydantic import BaseModel from typing import Literal client = OpenAI() class DocumentAnalysis(BaseModel): title: str sentiment: Literal["positive", "negative", "neutral"] key_points: list[str] confidence: float requires_human_review: bool response = client.beta.chat.completions.parse( model="gpt-4o-2024-08-06", messages=[ {"role": "system", "content": "Analyze the document and extract structured information."}, {"role": "user", "content": document_text} ], response_format=DocumentAnalysis ) result: DocumentAnalysis = response.choices[0].message.parsed # result is a typed Pydantic object — no parsing needed

Structured Output Methods Comparison

MethodReliabilitySchema supportRetry neededLibrary
Prompt ("return JSON")LowInformalOftenNone
json_object modeMediumNone (any JSON)SometimesOpenAI
Strict parse (Pydantic)HighFull PydanticRarelyOpenAI beta
Tool/function callingHighJSON SchemaRarelyOpenAI/Anthropic
Outlines (constrained)Very highRegex/EBNF/JSONAlmost neverOutlines
03 — Cross-Provider

Instructor Library

instructor wraps any LLM (OpenAI, Anthropic, Gemini, local) and adds Pydantic validation + automatic retries. On validation failure: feeds the error back to the model with the original request — model self-corrects.

Works across providers: single API, swap model without changing structured output code.

Instructor Example

import instructor from anthropic import Anthropic from pydantic import BaseModel, validator client = instructor.from_anthropic(Anthropic()) class ExtractedEntity(BaseModel): name: str entity_type: str confidence: float @validator("confidence") def confidence_range(cls, v): assert 0 <= v <= 1, "confidence must be 0-1" return v class ExtractionResult(BaseModel): entities: list[ExtractedEntity] summary: str # instructor handles validation + retries automatically result = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, response_model=ExtractionResult, messages=[{"role": "user", "content": f"Extract entities from: {text}"}] ) # result.entities is a validated list[ExtractedEntity]
instructor's max_retries=3 parameter means the model gets up to 3 attempts to fix validation errors. For complex nested schemas, this dramatically reduces production failures.
04 — Token-Level

Constrained Decoding with Outlines

Constrained decoding: instead of sampling freely from the full vocabulary, mask out tokens that would violate the schema at each generation step

Outlines: open-source library that intercepts token logits and zeroes out invalid tokens before sampling. Works with any local model (Transformers, vLLM).

Supported constraints: JSON Schema, Pydantic models, regex patterns, EBNF grammars, choice from a list

Outlines with Local Model

import outlines from pydantic import BaseModel model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2") class Movie(BaseModel): title: str director: str year: int genre: list[str] # Constrained generation — guaranteed valid JSON matching Movie schema generator = outlines.generate.json(model, Movie) movie = generator("Extract movie details from: The Godfather (1972) directed by Coppola") # movie is a Movie object — no parsing, no retries needed # Regex constraint phone_gen = outlines.generate.regex(model, r"\d{3}-\d{3}-\d{4}") phone = phone_gen("What's the phone number? Area code 555, number 867-5309") # guaranteed to produce NNN-NNN-NNNN format

Outlines vs Native Structured Output

AspectOpenAI strict modeOutlines (local)
GuaranteeSchema-levelToken-level (absolute)
LatencyAPI round-tripLocal GPU
ModelGPT-4o onlyAny GGUF/HF model
Grammar supportJSON Schema onlyJSON, regex, EBNF
CostPer tokenInfrastructure only
05 — Grammar Approach

Grammar-Guided Generation and LMQL

EBNF grammars: define the exact syntax of valid outputs. Any context-free grammar can be enforced.

llama.cpp grammar: --grammar or --grammar-file flag to enforce output format at inference time

LMQL (Language Model Query Language): SQL-like language for constrained generation with variables, conditionals, and loops

Guidance: Microsoft library for interleaving Python control flow with LLM generation. Template with {{gen ...}} blocks.

llama.cpp Grammar for Structured Address

# address.gbnf root ::= city "," ws state ws zip city ::= [a-zA-Z ]+ state ::= [A-Z][A-Z] zip ::= [0-9][0-9][0-9][0-9][0-9] ws ::= " "? # Usage: # ./llama-cli -m model.gguf --grammar-file address.gbnf \ # -p "What city is the Eiffel Tower in? Answer:"
06 — Specialized

Classification and Choice Constraints

Multiple-choice: constrain output to exactly one of N options — eliminates ambiguous paraphrasing

Outlines generate.choice: guarantees output is one of the provided strings, no variations

Choice Constraint for Classification

import outlines model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2") # Guarantee exactly one of these labels — no "Positive.", "POSITIVE", etc. sentiment = outlines.generate.choice( model, ["positive", "negative", "neutral", "mixed"] ) label = sentiment("Classify: 'The product works but shipping was slow'") # label is guaranteed to be one of the four strings # Token efficiency: first-token decoding for binary choices # Only one forward pass needed when choices diverge at token 1
⚠️ For classification with >2 classes, constrained decoding is almost always more reliable than parsing free text. The latency cost is negligible compared to the reliability gain.
07 — Deployment

Production Patterns

Validate at the boundary

  • Always validate LLM output at the application layer with Pydantic, even when using strict mode
  • Defense in depth

Retry with error context

  • On validation failure, send the error message back to the model: "Your previous output failed validation: {error}. Please fix it."
  • Up to 3 retries

📋 Schema versioning

  • Treat your output schema as part of your API contract
  • Schema changes should be versioned
  • Test new schemas against your golden eval set before deploying

Graceful degradation

  • When structured output fails after retries, fall back to a simpler schema or return a "parsing failed" sentinel
  • Log failures for schema analysis

Tools & Libraries

Validation
instructor
Cross-provider structured outputs with auto-retry
Local
Outlines
Constrained decoding for local models
Generation
Guidance
Interleave Python control flow with generation
Query
LMQL
SQL-like constrained generation language
Schema
Pydantic
Data validation and serialization
Framework
LangChain
Output parsers and chains
Inference
vLLM
Guided decoding for batch inference
Patterns
Marvin
Pydantic integration for structured outputs
08 — Further Reading

References

Documentation & Guides