JSON Mode

The JSON-from-LLM problem
How JSON mode works
Anthropic: prefill trick
OpenAI JSON mode vs. structured outputs
Schema enforcement beyond syntax
Extraction patterns
Gotchas

SECTION 01

The JSON-from-LLM problem

You ask an LLM "return your answer as JSON". Sometimes it wraps the JSON in markdown fences (```json ... ```). Sometimes it adds a preamble sentence. Sometimes the keys are inconsistently quoted. Your json.loads() call throws an exception in production at 2 a.m.

JSON mode is the provider's solution: it modifies the decoding process (or constrains the output format) so the model can only produce syntactically valid JSON — no fences, no prose, no trailing commas.

SECTION 02

How JSON mode works

Under the hood, JSON mode uses one of two techniques depending on the provider:

Logit masking (used by some local frameworks): at each token step, tokens that would produce invalid JSON are assigned probability 0. The model can only extend a valid JSON prefix.

Fine-tuning + RLHF instruction following (used by OpenAI/Anthropic): the model is trained to comply with a "respond in JSON" instruction with very high reliability, combined with a post-processing check that retries if the output isn't valid.

For practical purposes: both approaches give you effectively guaranteed valid JSON syntax (not schema correctness — just syntactic validity).

SECTION 03

Anthropic: the prefill trick

Anthropic doesn't offer a dedicated "JSON mode" flag, but there's an elegant workaround: prefill the assistant turn with an opening brace. The model then continues the JSON.

import anthropic, json

client = anthropic.Anthropic()

def extract_json(user_prompt: str, system_prompt: str) -> dict:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=system_prompt,
        messages=[
            {"role": "user",      "content": user_prompt},
            {"role": "assistant", "content": "{"}  # ← prefill forces JSON
        ]
    )
    # The response continues from "{", so prepend it back
    raw = "{" + response.content[0].text
    return json.loads(raw)

result = extract_json(
    user_prompt="Extract: John Smith, age 34, lives in Austin TX",
    system_prompt="Extract person information and return ONLY a JSON object with keys: name, age, city, state."
)
print(result)
# {"name": "John Smith", "age": 34, "city": "Austin", "state": "TX"}

The prefill trick is highly reliable — the model starts generating JSON tokens immediately and has no "room" to add prose before the JSON.

SECTION 04

OpenAI JSON mode vs. structured outputs

from openai import OpenAI
import json

client = OpenAI()

# JSON mode (syntax only — no schema enforcement)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Extract entity info as JSON."},
        {"role": "user",   "content": "Alice Wong is a 28-year-old engineer in Seattle."}
    ]
)
data = json.loads(response.choices[0].message.content)

# Structured outputs (schema enforcement — requires JSON Schema)
from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int
    city: str
    role: str

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract entity info."},
        {"role": "user",   "content": "Alice Wong is a 28-year-old engineer in Seattle."}
    ],
    response_format=Person,
)
person = response.choices[0].message.parsed
print(person.age)  # 28, type: int — schema enforced

JSON mode: valid syntax, but the model decides keys/types. Good for exploratory extraction.

Structured outputs: enforced schema. The model cannot deviate from the Pydantic model definition. Use this for production pipelines.

SECTION 05

Schema enforcement beyond syntax

JSON mode guarantees syntax (json.loads won't throw), but not schema (you might get unexpected keys, wrong types, or missing required fields). Validate with Pydantic:

from pydantic import BaseModel, ValidationError
from typing import Literal
import json

class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float      # 0.0–1.0
    key_phrases: list[str]

raw = extract_json(
    user_prompt="The battery life is incredible but the screen is dim.",
    system_prompt="Return JSON with: sentiment (positive/negative/neutral), confidence (0-1 float), key_phrases (list of strings)."
)

try:
    result = SentimentResult(**raw)
    print(result.sentiment, result.confidence)
except ValidationError as e:
    print("Schema mismatch:", e)
    # Handle: retry with stronger instructions, default values, etc.

SECTION 06

Extraction patterns

Nested extraction:

system = '''
Extract structured data from the invoice text.
Return ONLY valid JSON matching this exact schema:
{
  "vendor": string,
  "total_amount": number,
  "line_items": [{"description": string, "quantity": number, "unit_price": number}],
  "due_date": string (ISO 8601)
}
'''

Null handling: Always tell the model what to return for missing fields:

system = '''
Extract fields if present; use null for missing values.
Return: {"name": string|null, "email": string|null, "phone": string|null}
'''

Array extraction: For extracting multiple entities from a document:

system = '''
Extract all people mentioned. Return: {"people": [{"name": string, "role": string}]}
If no people mentioned, return: {"people": []}
'''

SECTION 07

Gotchas

JSON mode ≠ schema mode. The model can return syntactically valid JSON with entirely wrong keys or types. Always validate with Pydantic or jsonschema after parsing.

Streaming + JSON is awkward. Streaming emits tokens one at a time — you can't parse JSON until the stream ends. Buffer the full response before parsing, or use structured streaming if your provider supports it.

Large JSON = truncation risk. If max_tokens is too low, the JSON will be cut mid-object and fail to parse. For large extractions, increase max_tokens generously or chunk the input.

The prefill trick and stop sequences. When using the prefill trick with Anthropic, don't add a stop sequence that could trigger mid-JSON. Also, the response object won't include the { you prefilled — always prepend it manually.

JSON Mode vs. Tool Calling vs. Structured Outputs

LLM APIs offer several mechanisms for obtaining structured, parseable outputs. Understanding the trade-offs between JSON mode, tool calling, and structured output modes helps select the right approach for each extraction and generation use case.

Mechanism	Schema Enforcement	Guaranteed Valid	Flexibility	Support
JSON mode	None (just valid JSON)	Valid JSON only	High	OpenAI, Groq
Tool calling	JSON Schema per tool	Near-guaranteed	Medium	OpenAI, Anthropic
Structured outputs	Strict JSON Schema	Guaranteed	Medium	OpenAI (newer models)
Constrained decoding	Full grammar	Guaranteed	Highest	Open-source backends

JSON mode instructs the LLM to produce valid JSON without enforcing any particular schema. The generated JSON will be syntactically valid and parseable but may contain different keys, missing fields, or unexpected value types on different requests for the same prompt. JSON mode is appropriate when the structure of the output is explicitly described in the prompt and the application can handle minor schema variations, but is insufficient for applications requiring machine-readable outputs with strict schema compliance.

OpenAI's structured outputs feature uses a constrained decoding approach that guarantees the generated JSON exactly matches the provided JSON Schema, with no additional or missing fields and all value types matching their schema declarations. This guarantee eliminates the category of parsing failures where a model produces plausible but schema-invalid JSON, allowing application code to parse and use the output without defensive error handling. The trade-off is that only a subset of JSON Schema features are supported and the constrained decoding process is slightly slower than unconstrained generation.

Prompt design for reliable JSON extraction requires explicit formatting instructions that guide the model toward consistent key naming, value formatting, and null handling. Specifying that missing optional fields should be represented as null rather than omitted, that dates should use ISO 8601 format, and that numeric values should not be quoted as strings reduces the post-processing burden significantly. Few-shot examples of correct JSON output are the most reliable mechanism for establishing format conventions when the schema is complex or ambiguous from description alone.

Error recovery strategies for JSON parsing failures are important because even JSON mode and tool calling occasionally produce malformed outputs under adverse conditions — very long outputs, unusual prompts, or model errors during generation. A robust extraction pipeline implements: primary parsing attempt, secondary retry with explicit error feedback fed back to the model, and fallback to regex-based partial extraction for critical fields if full JSON parsing fails after retries. Tracking parsing failure rates by model, prompt version, and query type identifies systematic issues before they affect a significant fraction of production requests.

Nested JSON schemas with deeply recursive structures or circular references can challenge LLMs because they require correctly tracking parent-child relationships across many key-value pairs. Flattening deeply nested schemas into shallower representations — using dot notation keys like "user.address.city" rather than nested objects — reduces the cognitive complexity for the model and improves extraction accuracy on complex schemas. For schemas that genuinely require deep nesting, extracting nested objects in multiple chained calls (extract the outer object first, then extract nested details) is more reliable than single-pass extraction of the full nested structure.

Streaming JSON responses require specialized parsing strategies because standard JSON parsers require the complete document before returning any results. Incremental JSON parsers like ijson process JSON token-by-token, enabling applications to begin processing array elements or object fields as they arrive in the stream rather than buffering the complete response. For LLM responses returning large arrays of structured objects, streaming with incremental parsing can reduce time-to-first-result by seconds compared to buffering the full response, significantly improving perceived performance for users waiting on results.