o3 / o4-mini

What reasoning models are
When to use o3 vs o4-mini vs GPT-4o
Reasoning effort levels
Tool use with reasoning models
Streaming reasoning tokens
Cost analysis and break-even
Gotchas

SECTION 01

What reasoning models are

Reasoning models (o1, o3, o4-mini) are trained to spend tokens "thinking" before producing their final answer. The model generates a private chain-of-thought — a scratchpad where it works through the problem step-by-step — then produces a polished final answer. Users see the final answer; the thinking tokens are consumed internally (and charged for).

This is fundamentally different from asking a standard model to "think step by step" (chain-of-thought prompting). With standard models, the visible reasoning IS the computation — there's no separate thinking phase. With o-series models, the thinking is a trained behaviour optimised specifically for reasoning quality through reinforcement learning on verifiable tasks.

The result: dramatically better performance on problems requiring planning, logical deduction, mathematics, and multi-step code generation. On competition math (AIME) and competitive programming, o3 scores far above GPT-4o. For simple factual questions or creative writing, the difference is minimal and the cost premium isn't worth it.

SECTION 02

When to use o3 vs o4-mini vs GPT-4o

Use GPT-4o for: conversational applications, text generation, vision tasks, structured extraction, anything latency-sensitive, and tasks that don't require deep reasoning. It's 10–15× cheaper than o3 and much faster.

Use o4-mini for: coding tasks (bug fixing, code generation), technical question answering, data analysis, math, and situations where GPT-4o gives wrong answers but the task isn't complex enough to justify o3's cost. o4-mini hits ~80% of o3's accuracy at ~10% of the cost. It's the sweet spot for most reasoning tasks.

Use o3 for: hard research problems, competition-level math, complex multi-step code generation, and agentic tasks where the model must plan long sequences of actions correctly. The higher cost is justified when wrong answers have high downstream costs (incorrect code gets deployed, wrong scientific analysis informs decisions).

from openai import OpenAI
client = OpenAI()

# o4-mini for most reasoning tasks
response = client.chat.completions.create(
    model="o4-mini",
    messages=[{"role": "user", "content": "Solve: find all integer solutions to x² - 5x + 6 = 0"}],
)
print(response.choices[0].message.content)

SECTION 03

Reasoning effort levels

from openai import OpenAI
client = OpenAI()

# reasoning_effort controls how many thinking tokens to spend
# "low" = fast/cheap, "medium" = balanced, "high" = best quality

# Low effort — fast, cheaper, good for straightforward reasoning
response = client.chat.completions.create(
    model="o3",
    messages=[{"role": "user", "content": "What's 2+2?"}],
    reasoning_effort="low",
)

# High effort — slower, more expensive, for hard problems
response = client.chat.completions.create(
    model="o3",
    messages=[{
        "role": "user",
        "content": '''Debug this Python function that's supposed to find the longest
        palindromic substring but fails on edge cases:

        def longest_palindrome(s):
            if not s: return ""
            start, max_len = 0, 1
            for i in range(len(s)):
                for l, r in [(i,i), (i,i+1)]:
                    while l >= 0 and r < len(s) and s[l] == s[r]:
                        if r - l + 1 > max_len:
                            start, max_len = l, r - l + 1
                        l -= 1; r += 1
            return s[start:start+max_len]
        '''
    }],
    reasoning_effort="high",
)
print(response.choices[0].message.content)
print(f"Reasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}")
print(f"Output tokens: {response.usage.completion_tokens_details.accepted_prediction_tokens}")

SECTION 04

Tool use with reasoning models

import json

# o3/o4-mini support tool use — the model reasons about WHEN and HOW to use tools
tools = [{
    "type": "function",
    "function": {
        "name": "python_interpreter",
        "description": "Execute Python code. Use for calculations, data processing, testing hypotheses.",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Python code to execute"},
            },
            "required": ["code"]
        }
    }
}]

messages = [{
    "role": "user",
    "content": "Find all prime numbers between 1000 and 1100 and calculate their sum."
}]

response = client.chat.completions.create(
    model="o4-mini",
    messages=messages,
    tools=tools,
    reasoning_effort="medium",
)

# o4-mini will decide to write and execute Python rather than computing by hand
msg = response.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        print(f"Code to execute:
{args['code']}")
        # Execute and feed result back...

# Key difference from GPT-4o: o-series models reason about tool strategy
# before calling — they're less likely to call tools unnecessarily

SECTION 05

Streaming reasoning tokens

# Stream both reasoning and final answer tokens
stream = client.chat.completions.create(
    model="o3",
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
    reasoning_effort="medium",
    stream=True,
    stream_options={"include_usage": True},
)

reasoning_text = []
answer_text = []

for chunk in stream:
    if not chunk.choices:
        if chunk.usage:
            print(f"
Total: {chunk.usage.total_tokens} tokens")
        continue

    delta = chunk.choices[0].delta

    # Note: reasoning content is NOT exposed in the API by default
    # You see only the final answer in delta.content
    if delta.content:
        answer_text.append(delta.content)
        print(delta.content, end="", flush=True)

# To see reasoning traces, use the o1-preview model with specific settings,
# or use Anthropic's Claude extended thinking (which exposes thinking blocks)

OpenAI does not expose the full internal reasoning trace for o3/o4-mini. If visible chain-of-thought is important for your use case (debugging, user trust, auditing), consider Claude's extended thinking which shows the thinking blocks explicitly.

SECTION 06

Cost analysis and break-even

# o3 vs o4-mini vs GPT-4o pricing (early 2025, approximate)
PRICES = {
    "gpt-4o":   {"input": 2.50,   "output": 10.0,  "reasoning": 0},
    "o4-mini":  {"input": 1.10,   "output": 4.40,  "reasoning": 1.10},
    "o3":       {"input": 10.0,   "output": 40.0,  "reasoning": 10.0},
}
# All prices per 1M tokens

def estimate_cost(model: str, input_tokens: int, output_tokens: int,
                  reasoning_tokens: int = 0) -> float:
    p = PRICES[model]
    return (input_tokens * p["input"] +
            output_tokens * p["output"] +
            reasoning_tokens * p["reasoning"]) / 1_000_000

# Typical coding task: 500 input, 300 output, 2000 reasoning tokens
for model in ["gpt-4o", "o4-mini", "o3"]:
    r = 2000 if model != "gpt-4o" else 0
    cost = estimate_cost(model, 500, 300, r)
    print(f"{model}: ${cost:.5f} per task")
# gpt-4o:   $0.00425
# o4-mini:  $0.00377
# o3:       $0.02700

# Break-even analysis: use o4-mini if it reduces error rate by >10%
# relative to GPT-4o — for tasks where errors have downstream costs

SECTION 07

Gotchas

Don't ask reasoning models to "think step by step". They already do this internally. Adding chain-of-thought instructions wastes tokens and may actually reduce quality by constraining the model's native reasoning approach. Just ask the question directly and clearly.

Reasoning tokens are charged but not shown. A request that "only" generates 200 output tokens may consume 3000 reasoning tokens internally — and you're billed for all of them. Monitor usage.completion_tokens_details.reasoning_tokens to understand actual costs, especially with reasoning_effort="high".

Latency is much higher than GPT-4o. o3 with high reasoning effort can take 30–120 seconds for complex problems. For user-facing applications, show a progress indicator, use streaming, and set a reasonable timeout. o4-mini with low effort gets closer to GPT-4o latency.

SECTION 08

Reasoning Model Selection Guide

Task Type	Best Model	Reasoning Effort	Notes
Simple Q&A, summarisation	GPT-4o / Claude Sonnet	None needed	Reasoning models overkill here
Moderate coding / analysis	o4-mini (low effort)	Low	3–5× cheaper than o3 full
Hard math, competition problems	o3 (medium effort)	Medium	Good accuracy/cost balance
Research-level problems	o3 (high effort)	High	Budget-aware; set max thinking tokens
Agentic multi-step tasks	o3 or o4-mini	Medium	Tool use + reasoning is the sweet spot

The break-even analysis for reasoning models hinges on task complexity. For tasks where GPT-4o already achieves 90%+ accuracy, reasoning models add cost without quality gain. For tasks where GPT-4o is at 50–70% accuracy (hard math, complex code debugging, multi-step planning), o3 typically lifts accuracy to 85–95% — a quality improvement that often justifies the 5–10× higher per-call cost, especially if the task outcome has high business value. Instrument your pipeline to measure per-task-type accuracy so you can route to reasoning models selectively rather than universally.

Streaming reasoning tokens from o3 provides user experience benefits but also debugging insights. The visible thinking trace shows which sub-problems the model decomposed the question into, which approaches it tried and abandoned, and how it reached its final answer. Use this trace in development to diagnose why the model gave a wrong answer — the error almost always appears as a flawed reasoning step in the trace, which points directly to what additional context or constraint needs to be added to the prompt.

When budgeting for reasoning models, account for thinking tokens which are charged at the same rate as output tokens but not shown in the visible response. A query with 200 visible output tokens may have consumed 2,000 thinking tokens, making the real cost 11x higher than a naive count suggests. Use usage.completion_tokens_details.reasoning_tokens in the API response to track actual consumption and include it in cost monitoring from day one.

For agentic tasks with reasoning models, prefer many small tool-use steps over few large steps. Reasoning models benefit from tight observe-reason-act cycles where each tool result updates the model's working context before the next action. Batching multiple actions into a single step deprives the model of intermediate feedback that would have changed its reasoning. Design your tool interfaces to return focused, actionable results rather than large data dumps that the model must parse in isolation.