LLMs generate text token-by-token, unconstrained. For structured extraction — entity recognition, classification, JSON generation — we need guaranteed outputs matching a schema. Naive approach: ask the LLM to return JSON, parse it, pray it's valid. This fails regularly.
Common Failures
Failure mode
Frequency
Cause
Invalid JSON (trailing comma, unquoted key)
5–15%
LLM doesn't enforce JSON spec
Missing required fields
3–10%
LLM forgets constraint
Wrong type (string instead of int)
2–8%
LLM guesses data type
Field name typo
1–5%
Variation in field names
Extra unexpected fields
2–7%
LLM adds context it thinks helpful
Naive fix: add "return JSON" to prompt. Better: constrain generation. Best: guarantee schema compliance at token level.
💡Key insight: Structured output extraction is less about asking nicely, more about enforcing constraints at the generation level. Three approaches exist, with different tradeoffs.
02 — Solutions
Three Approaches to Structured Generation
Different techniques exist, ranging from simple to sophisticated. Choose based on robustness needs, latency budget, and what your model supports.
Approach 1: Native JSON Mode (Simple)
How: Some LLMs (Claude, GPT-4) support a "JSON mode" flag that constrains output to valid JSON. Pros: Simple, fast, one API parameter. Cons: Only validates JSON format, not schema adherence. Still need client-side validation. Best for: Quick prototypes, high-confidence tasks.
Approach 2: Post-Processing & Retry (Practical)
How: Generate response, validate against schema, retry if invalid. Use Instructor library for this. Pros: Works with any LLM, simple to implement, handles ambiguous schemas. Cons: Latency overhead (retries add 100–500ms), not guaranteed to succeed after N retries. Best for: Production systems where reliability matters and latency is acceptable.
Approach 3: Constrained Decoding (Guaranteed)
How: Use Outlines or LMQL to enforce constraints at the token level. At each step, only allow tokens that keep output schema-valid. Pros: Guaranteed valid output, no retries needed. Cons: Requires access to model logits, only works with certain inference engines (vLLM, Ollama). Best for: Self-hosted models, strict schema requirements.
Approach
Latency
Schema guarantee
Ease of use
Model support
Native JSON mode
Fast
Format only
Easiest
Claude, GPT-4
Post-processing + retry (Instructor)
Medium
With retries
Easy
All
Constrained decoding (Outlines)
Medium
Guaranteed
Medium
vLLM, local
✓Recommendation: Start with Instructor + JSON mode. Fast to implement, covers 90% of cases. Move to Outlines if you need hard schema guarantees.
03 — Instructor Library
Instructor: Structured Extraction at Scale
Instructor is a Python library (works with many LLM APIs) that wraps LLMs and enforces Pydantic schema compliance. Define your data type as a Pydantic model, call the LLM, get back validated objects. Simple and powerful.
Core Pattern
Step 1: Define Pydantic model (your schema). Step 2: Create instructor client. Step 3: Call LLM with response_model=YourModel. Step 4: Get back validated instance. Step 5: Retries happen transparently if validation fails.
from instructor import Instructor
from pydantic import BaseModel, Field
from anthropic import Anthropic
# Define schema
class Person(BaseModel):
name: str = Field(description="Full name")
age: int = Field(description="Age in years, or None if unknown")
email: str | None = Field(description="Email address")
role: str = Field(description="Job title or role")
# Create client (wraps Anthropic)
client = Instructor(Anthropic())
# Extract with guarantee of valid schema
person = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=256,
response_model=Person,
messages=[{
"role": "user",
"content": "Extract: John Doe, 28, john@company.com, Software Engineer"
}]
)
# person is a validated Person instance
print(person.name) # "John Doe"
print(person.age) # 28
print(person.role) # "Software Engineer"
Instructor Features
Automatic retry: If response fails validation, Instructor re-prompts with error message. Typical flow: 1 main call + 0–2 retries. Multiple models: Works with OpenAI, Anthropic, Cohere, local models. Async support: Full async/await support for concurrent extraction. Validation hooks: Use Pydantic field validators for custom logic.
⚠️Cost of retries: Each retry is a full LLM call (tokens). For 100k extraction tasks with 5% failure rate, expect ~5k retries = 5k extra API calls. Budget accordingly or tune prompt to reduce failures.
04 — Constrained Decoding
Outlines: Token-Level Schema Enforcement
Outlines (formerly Guides) is a framework for constrained decoding. It modifies the LLM's generation to only produce tokens that maintain schema validity. Zero failures, by design.
How Outlines Works
Tokenization: Convert schema (JSON Schema, Pydantic, regex) to finite automaton. Generation: At each token, mask invalid options based on automaton state. LLM can only choose valid tokens. Output: Generated text is guaranteed valid schema.
Comparison with Instructor
Aspect
Instructor
Outlines
Validation method
Post-generation, with retries
Token-level constraints
Success rate
~95% (after retries)
100% (by design)
Latency per call
Single + retries
Slightly slower (masking overhead)
Model support
API models (OpenAI, Anthropic)
vLLM, Ollama (local models)
Ease of use
Very simple
Medium (requires inference engine)
⚠️Outlines limitation: Requires access to model logits and inference control. Only works with vLLM, Ollama, or custom inference servers. Can't use OpenAI or Anthropic APIs directly (they don't expose token masking).
05 — Schema Definition
Pydantic Schemas for Extraction
Whether using Instructor, Outlines, or native JSON mode, define schemas with Pydantic. It's the standard for Python-based LLM extraction.
Writing Good Extraction Schemas
1
Use descriptive fields — Clarity helps LLM
Field descriptions guide the LLM. Be explicit about what you want.
Bad: name: str
Good: name: str = Field(description="Person's full name (first and last)")
2
Use optional for ambiguous fields — Reduce failures
If a field might not exist, make it Optional. Reduces retry loops.
phone: str | None = Field(default=None)
Allows LLM to omit if not found
3
Use enums for categories — Constrain choices
For classification, use Enum instead of str.
from enum import Enum
class Sentiment(str, Enum): positive = "positive" ...
Outlines can guarantee exact enum values
4
Nest models for structure — Compose schemas
Complex extractions use nested models.
class Address(BaseModel): street, city, country
class Person(BaseModel): name, address: Address
from pydantic import BaseModel, Field
from enum import Enum
from typing import Optional
class Sentiment(str, Enum):
POSITIVE = "positive"
NEGATIVE = "negative"
NEUTRAL = "neutral"
class ReviewExtraction(BaseModel):
product_name: str = Field(description="Name of product reviewed")
rating: int = Field(ge=1, le=5, description="1-5 star rating")
sentiment: Sentiment = Field(description="Overall sentiment")
summary: str = Field(description="2-3 sentence summary of review")
reviewer_name: Optional[str] = Field(default=None, description="Name of reviewer, if given")
# Instructor guarantees ReviewExtraction instance
# All fields present, rating 1-5, sentiment in enum, no extra fields
✓Schema design rule: When in doubt, make it Optional. An Optional field that's None is better than an exception. LLM can always omit uncertain data.
06 — Robustness
Error Handling and Retry Strategies
Even with Instructor's automatic retries, some extractions fail. Plan for it.
Common Failure Patterns
LLM refusal: LLM declines to extract (safety filter). Handle gracefully. Ambiguous input: Text doesn't clearly contain requested data. Return Optional None instead of hallucinating. Timeout: Extraction takes >30s (stuck in retry loop). Implement max_retries. Malformed JSON: Rare with Instructor, but can happen with native JSON mode.
Best Practices
Scenario
Solution
Extraction fails after N retries
Return partial result or None, log for review
Input is too short or ambiguous
Set optional fields to None; don't retry
LLM refuses to extract
Catch exception, fallback to manual/default
Want to debug failure
Log full prompt, response, error; analyze
Batch extraction with high volume
Set max_retries=2, timeout per task, skip problematic items
💡Retry strategy: 1–2 retries is usually enough. Beyond that, diminishing returns. If it fails twice, it's likely the input is genuinely ambiguous. Better to return None and flag for human review than waste tokens on retries.
# Production-grade extraction with retry and fallback
# pip install instructor openai pydantic tenacity
from pydantic import BaseModel, Field, field_validator
from typing import Optional, List
import instructor, openai
from tenacity import retry, stop_after_attempt, wait_exponential
client = instructor.from_openai(openai.OpenAI())
class LineItem(BaseModel):
description: str
quantity: int = Field(ge=1)
unit_price: float = Field(ge=0)
total: float
@field_validator("total")
@classmethod
def validate_total(cls, v, info):
expected = info.data.get("quantity", 1) * info.data.get("unit_price", 0)
if abs(v - expected) > 0.01:
raise ValueError(f"total {v} != quantity × unit_price {expected:.2f}")
return v
class Invoice(BaseModel):
invoice_number: str
vendor: str
date: str
line_items: List[LineItem]
subtotal: float
tax_rate: Optional[float] = 0.0
total_due: float
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=4))
def extract_invoice(raw_text: str) -> Invoice:
return client.chat.completions.create(
model="gpt-4o-mini",
response_model=Invoice,
messages=[
{"role": "system", "content": "Extract invoice data precisely. Validate all totals."},
{"role": "user", "content": raw_text},
],
max_retries=2, # instructor-level retries for validation errors
)
invoice = extract_invoice("Invoice #1042 from Acme Corp, 2024-03-15. 3x Widget A @ $12.50 = $37.50. Tax 8%. Total: $40.50")
print(invoice.model_dump_json(indent=2))
07 — Scale & Production
Production Extraction Patterns
Structured extraction at scale requires monitoring, batching, and fallbacks.
Patterns for Production
Batch extraction: Queue incoming tasks, extract in batches (10–100). Cheaper and faster than one-by-one. Caching: Same input → same output. Cache results for duplicate texts. Model selection: Use fast models (Haiku) for simple schemas, larger models (Sonnet) for complex ones. Monitoring: Track success rate, latency, cost per extraction. Alert on drops. Fallback: If extraction fails, either return None, use rule-based fallback, or queue for human review.
Example: Production Extraction Pipeline
1. Receive document. 2. Check cache. 3. If miss, queue extraction. 4. Call Instructor with timeout=10s, max_retries=2. 5. If success, cache result, return. 6. If fail after retries, log + route to human review queue. 7. Monitor: track success%, avg latency, cost/task.
⚠️Cost optimization: Structured extraction can be expensive at scale. 100k tasks × $0.001 per task = $100 base, plus retries. Optimize by: reducing schema complexity, using cheaper models where possible, caching aggressively.
08 — Further Reading
References and Related Concepts
Child Concepts
Instructor — Python library for structured extraction