01 — Core Challenge
The Problem with Free-Text Output
LLMs produce free text by default. Production systems need structured data: JSON objects, typed fields, validated schemas.
Three approaches in order of reliability: (1) prompt engineering ("respond only in JSON"), (2) native structured output modes (OpenAI/Anthropic), (3) constrained decoding (grammar-guided token sampling).
Why prompting alone fails: model can still produce preamble ("Sure! Here's the JSON: ..."), truncate mid-object, or add trailing text. Brittle at scale.
⚠️
Never use json.loads() on raw LLM output in production without a try/except and retry loop. Even "JSON mode" can produce subtly malformed output on edge cases.
02 — Provider-Native
Native Structured Output Modes
OpenAI response_format: set to {"type": "json_object"} for best-effort JSON, or use response_format=MyPydanticModel with beta.chat.completions.parse for strict schema enforcement
Anthropic: pass tool definitions — model always returns valid tool_use blocks matching the schema
Strict mode (OpenAI): guarantees output exactly matches your JSON Schema. No extra fields, no missing required fields. Uses constrained decoding under the hood.
OpenAI Strict Structured Output
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal
client = OpenAI()
class DocumentAnalysis(BaseModel):
title: str
sentiment: Literal["positive", "negative", "neutral"]
key_points: list[str]
confidence: float
requires_human_review: bool
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Analyze the document and extract structured information."},
{"role": "user", "content": document_text}
],
response_format=DocumentAnalysis
)
result: DocumentAnalysis = response.choices[0].message.parsed
# result is a typed Pydantic object — no parsing needed
Structured Output Methods Comparison
| Method | Reliability | Schema support | Retry needed | Library |
| Prompt ("return JSON") | Low | Informal | Often | None |
| json_object mode | Medium | None (any JSON) | Sometimes | OpenAI |
| Strict parse (Pydantic) | High | Full Pydantic | Rarely | OpenAI beta |
| Tool/function calling | High | JSON Schema | Rarely | OpenAI/Anthropic |
| Outlines (constrained) | Very high | Regex/EBNF/JSON | Almost never | Outlines |
03 — Cross-Provider
Instructor Library
instructor wraps any LLM (OpenAI, Anthropic, Gemini, local) and adds Pydantic validation + automatic retries. On validation failure: feeds the error back to the model with the original request — model self-corrects.
Works across providers: single API, swap model without changing structured output code.
Instructor Example
import instructor
from anthropic import Anthropic
from pydantic import BaseModel, validator
client = instructor.from_anthropic(Anthropic())
class ExtractedEntity(BaseModel):
name: str
entity_type: str
confidence: float
@validator("confidence")
def confidence_range(cls, v):
assert 0 <= v <= 1, "confidence must be 0-1"
return v
class ExtractionResult(BaseModel):
entities: list[ExtractedEntity]
summary: str
# instructor handles validation + retries automatically
result = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
response_model=ExtractionResult,
messages=[{"role": "user", "content": f"Extract entities from: {text}"}]
)
# result.entities is a validated list[ExtractedEntity]
✓
instructor's max_retries=3 parameter means the model gets up to 3 attempts to fix validation errors. For complex nested schemas, this dramatically reduces production failures.
04 — Token-Level
Constrained Decoding with Outlines
Constrained decoding: instead of sampling freely from the full vocabulary, mask out tokens that would violate the schema at each generation step
Outlines: open-source library that intercepts token logits and zeroes out invalid tokens before sampling. Works with any local model (Transformers, vLLM).
Supported constraints: JSON Schema, Pydantic models, regex patterns, EBNF grammars, choice from a list
Outlines with Local Model
import outlines
from pydantic import BaseModel
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
class Movie(BaseModel):
title: str
director: str
year: int
genre: list[str]
# Constrained generation — guaranteed valid JSON matching Movie schema
generator = outlines.generate.json(model, Movie)
movie = generator("Extract movie details from: The Godfather (1972) directed by Coppola")
# movie is a Movie object — no parsing, no retries needed
# Regex constraint
phone_gen = outlines.generate.regex(model, r"\d{3}-\d{3}-\d{4}")
phone = phone_gen("What's the phone number? Area code 555, number 867-5309")
# guaranteed to produce NNN-NNN-NNNN format
Outlines vs Native Structured Output
| Aspect | OpenAI strict mode | Outlines (local) |
| Guarantee | Schema-level | Token-level (absolute) |
| Latency | API round-trip | Local GPU |
| Model | GPT-4o only | Any GGUF/HF model |
| Grammar support | JSON Schema only | JSON, regex, EBNF |
| Cost | Per token | Infrastructure only |
05 — Grammar Approach
Grammar-Guided Generation and LMQL
EBNF grammars: define the exact syntax of valid outputs. Any context-free grammar can be enforced.
llama.cpp grammar: --grammar or --grammar-file flag to enforce output format at inference time
LMQL (Language Model Query Language): SQL-like language for constrained generation with variables, conditionals, and loops
Guidance: Microsoft library for interleaving Python control flow with LLM generation. Template with {{gen ...}} blocks.
llama.cpp Grammar for Structured Address
# address.gbnf
root ::= city "," ws state ws zip
city ::= [a-zA-Z ]+
state ::= [A-Z][A-Z]
zip ::= [0-9][0-9][0-9][0-9][0-9]
ws ::= " "?
# Usage:
# ./llama-cli -m model.gguf --grammar-file address.gbnf \
# -p "What city is the Eiffel Tower in? Answer:"
06 — Specialized
Classification and Choice Constraints
Multiple-choice: constrain output to exactly one of N options — eliminates ambiguous paraphrasing
Outlines generate.choice: guarantees output is one of the provided strings, no variations
Choice Constraint for Classification
import outlines
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
# Guarantee exactly one of these labels — no "Positive.", "POSITIVE", etc.
sentiment = outlines.generate.choice(
model, ["positive", "negative", "neutral", "mixed"]
)
label = sentiment("Classify: 'The product works but shipping was slow'")
# label is guaranteed to be one of the four strings
# Token efficiency: first-token decoding for binary choices
# Only one forward pass needed when choices diverge at token 1
⚠️
For classification with >2 classes, constrained decoding is almost always more reliable than parsing free text. The latency cost is negligible compared to the reliability gain.
07 — Deployment
Production Patterns
✓ Validate at the boundary
- Always validate LLM output at the application layer with Pydantic, even when using strict mode
- Defense in depth
⟲ Retry with error context
- On validation failure, send the error message back to the model: "Your previous output failed validation: {error}. Please fix it."
- Up to 3 retries
📋 Schema versioning
- Treat your output schema as part of your API contract
- Schema changes should be versioned
- Test new schemas against your golden eval set before deploying
⚡ Graceful degradation
- When structured output fails after retries, fall back to a simpler schema or return a "parsing failed" sentinel
- Log failures for schema analysis
Tools & Libraries
08 — Further Reading
References
Documentation & Guides