Frontier Models

GPT-4o

OpenAI's native multimodal flagship model — processes text, images, and audio in a single model rather than stitched pipelines. 128K context, 2× faster than GPT-4 Turbo, with structured outputs and function calling.

Native
Multimodal
128K
Context
Structured
Outputs

Table of Contents

SECTION 01

What makes GPT-4o different

GPT-4o ("o" for omni) is OpenAI's first natively multimodal model — meaning images, audio, and text share the same neural network, rather than being processed by separate models stitched together. The practical effect: GPT-4o can reason about the relationship between text and visual content more naturally than systems where a vision encoder feeds captions to a text model.

Compared to GPT-4 Turbo: GPT-4o is roughly 2× faster at inference, costs 50% less per token, has the same 128K context window, and matches or exceeds GPT-4 Turbo on most benchmarks. There's almost no reason to choose GPT-4 Turbo over GPT-4o for new projects.

Key capabilities: vision (image understanding, OCR, chart reading), structured outputs (JSON schemas enforced at the model level, not via prompting), function calling (parallel tool calls), and fine-tuning (available for domain specialisation).

SECTION 02

Text and chat API basics

pip install openai

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from env

# Basic chat completion
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain attention mechanisms in one paragraph."}
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} in, {response.usage.completion_tokens} out")

# Streaming
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about transformers."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# Cheaper, faster option for simple tasks
response = client.chat.completions.create(
    model="gpt-4o-mini",  # 15× cheaper than gpt-4o
    messages=[{"role": "user", "content": "Is this spam? Email: 'Win $1000 now!'"}],
)
SECTION 03

Vision capabilities

import base64
from pathlib import Path

def encode_image(path: str) -> str:
    return base64.b64encode(Path(path).read_bytes()).decode()

# Analyse a local image
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe what's in this chart and identify the trend."},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encode_image('chart.jpg')}",
                    "detail": "high"  # "low" for thumbnails, "high" for detailed analysis
                }
            }
        ]
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

# Multiple images in one request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What changed between these two screenshots?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/before.png"}},
            {"type": "image_url", "image_url": {"url": "https://example.com/after.png"}},
        ]
    }]
)
SECTION 04

Structured outputs

from pydantic import BaseModel
from typing import Optional

class ExtractedPerson(BaseModel):
    name: str
    age: Optional[int]
    occupation: str
    key_facts: list[str]

# Structured outputs — guaranteed valid JSON matching the schema
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "Extract person info: 'Alice Chen, 34, is a machine learning engineer at DeepMind who co-authored 12 papers on reinforcement learning.'"
    }],
    response_format=ExtractedPerson,
)
person = response.choices[0].message.parsed
print(f"{person.name}, {person.age}, {person.occupation}")
print(person.key_facts)

# For complex schemas with nested objects
class InvoiceData(BaseModel):
    vendor: str
    total_amount: float
    line_items: list[dict]
    due_date: Optional[str]

# GPT-4o enforces the schema at the decoding level — no JSON parsing errors
invoice = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract invoice data from: [invoice text]"}],
    response_format=InvoiceData,
).choices[0].message.parsed
SECTION 05

Function calling

import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Paris and London?"}],
    tools=tools,
    tool_choice="auto",  # model decides whether to call tools
)

# Check if model wants to call tools
msg = response.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        print(f"Calling {call.function.name}({args})")
        # Execute tool and feed result back
        result = get_weather(**args)  # your implementation
        # Continue conversation with tool results...

GPT-4o supports parallel tool calls: in the Paris/London example above, it returns both get_weather calls in a single response rather than two round-trips. Use parallel_tool_calls=False to disable if your tools have ordering dependencies.

SECTION 06

Cost and rate limits

GPT-4o pricing (as of early 2025): $2.50 per 1M input tokens, $10 per 1M output tokens. GPT-4o mini: $0.15 / $0.60. For reference: a typical RAG query with 2000 token context costs ~$0.005 with GPT-4o, ~$0.0003 with GPT-4o-mini.

# Cost estimation before sending
def estimate_cost(messages: list, model: str = "gpt-4o") -> float:
    import tiktoken
    enc = tiktoken.encoding_for_model(model)

    input_tokens = sum(
        len(enc.encode(m["content"])) + 4  # 4 tokens overhead per message
        for m in messages if isinstance(m.get("content"), str)
    )

    prices = {
        "gpt-4o": (2.50, 10.0),        # per 1M tokens
        "gpt-4o-mini": (0.15, 0.60),
    }
    in_price, out_price = prices.get(model, (2.50, 10.0))

    # Estimate output as 25% of input for typical tasks
    est_output = input_tokens * 0.25
    return (input_tokens * in_price + est_output * out_price) / 1_000_000

print(f"Estimated cost: ${estimate_cost(messages):.5f}")
SECTION 07

Gotchas

Structured outputs don't work with streaming. The schema enforcement happens at decoding time, which requires the full output to be available. If you need streaming, you'll need to handle partial JSON and validate after the stream completes, or use a two-step approach (stream first, re-request with structured output if needed).

Vision token costs are high. A high-resolution image uses 765–1105 tokens (detail="high"). Processing a 100-page scanned PDF as images costs dramatically more than OCR + text. Use detail="low" (85 tokens flat) for classification/routing tasks, and reserve detail="high" for tasks requiring reading fine print.

GPT-4o's training data has a knowledge cutoff. For current events, prices, or anything time-sensitive, use function calling to fetch live data rather than relying on the model's knowledge. The model may confidently answer outdated information without indicating uncertainty.

SECTION 08

GPT-4o vs Alternative Models

Use CaseGPT-4oo4-miniClaude 3.5 SonnetGemini 1.5 Pro
General chat & reasoningExcellentGood (cheaper)ExcellentExcellent
Vision & image understandingExcellentGoodExcellentExcellent
Code generationExcellentVery goodExcellentGood
Long context (>100K tokens)Good (128K)Good (128K)Excellent (200K)Excellent (1M+)
Cost per 1M input tokens$2.50$0.15$3.00$1.25

GPT-4o's native multimodal architecture (audio, vision, and text processed in a single model pass) gives it lower latency on multimodal tasks compared to models that pipeline separate encoders. For applications combining voice, image, and text inputs, this is a significant practical advantage. For pure text workloads, the quality difference between top-tier models is small enough that cost and API reliability often drive the decision more than benchmark scores. Run your own evals on domain-representative data before committing to a primary provider.

GPT-4o's native multimodality represents a fundamental architectural departure from earlier GPT-4 variants that used separate vision encoders bolted onto a text-only backbone. By training end-to-end on interleaved sequences of text, image, and audio tokens, GPT-4o develops richer cross-modal representations — the model can reason about the relationship between what is spoken and what is shown in a scene simultaneously, rather than processing each modality in isolation and fusing at a late stage.

The real-time audio capabilities of GPT-4o enable low-latency voice applications that were impractical with the previous pipeline of speech-to-text → LLM → text-to-speech. This three-step pipeline introduced 1–3 seconds of cumulative latency and lost prosodic information at each conversion boundary. GPT-4o's integrated audio processing collapses this pipeline, preserving tone and emotional cues from the speaker's voice and enabling natural turn-taking in conversational applications.

When evaluating GPT-4o for production use, consider the token cost structure carefully. Vision tokens are priced per tile of the image at high-resolution, so sending large images without downscaling can consume significantly more tokens than expected. For applications processing hundreds of images per hour, implementing client-side image resizing to the minimum resolution needed for the task can reduce costs by 60–80% with negligible quality impact for most document and UI analysis workloads.