Code Mode

Code mode vs JSON tool dispatch
The code-as-plan philosophy
Implementing a code executor
Typed SDK for the agent
Sandboxing code execution
When code mode beats tool dispatch
Gotchas

SECTION 01

Code mode vs JSON tool dispatch

Standard function calling: the model outputs {"name": "search", "args": {"query": "..."}}. Your code parses this JSON, routes to the right function, handles argument validation, runs the function, and returns the result. Each tool call is a round-trip.

Code mode: the model outputs Python code that calls your API directly. Instead of multiple JSON dispatch round-trips, the model writes a complete program:

results = search("LLM benchmarks 2024")
filtered = [r for r in results if r.date > "2024-01-01"]
summary = summarise(filtered[:3])
final_answer(summary)

This is more powerful: the model can use loops, conditionals, list comprehensions, and intermediate variables without a round-trip per operation. It's also more compact — one code block does what would take 5+ JSON tool calls. The SmolAgents and Code Interpreter patterns both use this approach.

SECTION 02

The code-as-plan philosophy

In tool dispatch, the model's "plan" is implicit — it's revealed one tool call at a time. In code mode, the model's plan is the code itself, making it explicit, readable, and auditable before execution.

This has three key benefits: transparency (you can read the code and understand what the agent intends to do before running it), composability (Python operations can be combined in any way — the model isn't limited to single tool calls), and efficiency (complex logic that would require 10+ JSON tool calls becomes a 20-line function).

The tradeoff is security: running arbitrary LLM-generated code is more dangerous than running a pre-defined list of tools. Every code-mode deployment needs a sandboxing strategy.

SECTION 03

Implementing a code executor

import anthropic
import re
import sys
from io import StringIO

client = anthropic.Anthropic()

# The SDK the agent can call (exposed to the agent)
class AgentSDK:
    '''Typed SDK available to the agent. Import as `sdk` in generated code.'''

    def search(self, query: str) -> list[dict]:
        '''Search the web. Returns list of {title, url, snippet}.'''
        return [{"title": f"Result for {query}", "url": "https://example.com", "snippet": "..."}]

    def calculate(self, expression: str) -> float:
        '''Safely evaluate a math expression.'''
        import ast, operator
        ops = {ast.Add: operator.add, ast.Sub: operator.sub,
               ast.Mult: operator.mul, ast.Div: operator.truediv}
        def safe_eval(node):
            if isinstance(node, ast.Num): return node.n
            if isinstance(node, ast.BinOp): return ops[type(node.op)](safe_eval(node.left), safe_eval(node.right))
            raise ValueError(f"Unsupported: {node}")
        return safe_eval(ast.parse(expression, mode='eval').body)

    def final_answer(self, answer: str) -> str:
        return answer

sdk = AgentSDK()

def execute_code(code: str) -> tuple[str, str | None]:
    '''Execute code in a restricted namespace. Returns (stdout, error).'''
    # Capture stdout
    old_stdout = sys.stdout
    sys.stdout = buffer = StringIO()
    error = None
    try:
        exec(code, {"sdk": sdk, "__builtins__": {"print": print, "len": len, "range": range, "str": str, "int": int, "float": float, "list": list, "dict": dict}})
    except Exception as e:
        error = str(e)
    finally:
        sys.stdout = old_stdout
    return buffer.getvalue(), error

SYSTEM = '''You are a code-mode agent. To answer the user, write Python code using the `sdk` object.
Available methods: sdk.search(query), sdk.calculate(expression), sdk.final_answer(answer)
Output code in a ```python block. The code will be executed and output returned to you.'''

def code_agent(user_query: str) -> str:
    messages = [{"role": "user", "content": user_query}]
    for _ in range(5):
        resp = client.messages.create(
            model="claude-sonnet-4-5", max_tokens=1024,
            system=SYSTEM, messages=messages
        )
        text = resp.content[0].text
        # Extract code block
        match = re.search(r'```python
(.*?)```', text, re.DOTALL)
        if not match:
            return text  # No code — direct answer
        code = match.group(1)
        stdout, error = execute_code(code)
        messages.append({"role": "assistant", "content": text})
        if error:
            messages.append({"role": "user", "content": f"Error: {error}"})
        else:
            messages.append({"role": "user", "content": f"Output: {stdout}"})
            if "final_answer" in code:
                return stdout
    return "Max steps reached"

print(code_agent("What is 1234 * 5678?"))

SECTION 04

Typed SDK for the agent

The quality of code-mode agents depends heavily on how well the SDK is documented. The model generates code based on the docstrings and type hints in the system prompt:

SDK_DOCS = '''
Available SDK (import as `sdk`):

sdk.search(query: str) -> list[dict]
    Search web. Returns [{"title": str, "url": str, "snippet": str}].
    Use for: current information, facts, news.

sdk.database_query(sql: str) -> list[dict]
    Execute a read-only SQL query on the product database.
    Tables: products(id, name, price, stock), orders(id, user_id, product_id, status)
    IMPORTANT: SELECT only — no INSERT/UPDATE/DELETE.

sdk.send_notification(user_id: str, message: str) -> bool
    Send a push notification. Returns True if successful.
    Rate limit: max 1 per user per hour.

sdk.final_answer(answer: str | dict) -> None
    Call when you have the answer. Pass a string or dict.
    ALWAYS call this to end the agent loop.
'''

Include usage examples, return types, rate limits, and explicit warnings about what NOT to do. The model follows the docs precisely.

SECTION 05

Sandboxing code execution

Running LLM-generated code is dangerous without sandboxing. Levels of isolation from least to most secure:

Restricted builtins (minimal protection): pass a limited __builtins__ dict to exec(). Prevents obvious attacks but determined code can escape. Never use in production for untrusted models.

Docker container: run the code executor in a Docker container with no network access, read-only filesystem, and resource limits. The agent's calls to your SDK go through a secure channel:

# docker-compose.yml
services:
  code-executor:
    image: python:3.11-slim
    read_only: true
    network_mode: none
    mem_limit: 256m
    cpus: 0.5
    volumes:
      - /tmp/agent-sandbox:/tmp:rw

E2B (cloud sandbox): managed cloud sandboxes specifically designed for LLM code execution. Pay-per-use, no infrastructure management, with file I/O and network access controls. Best for production.

WebAssembly (Pyodide): run Python in a Wasm sandbox in the browser or a confined runtime. No network access by default, strong isolation, but limited library support.

SECTION 06

When code mode beats tool dispatch

Complex data processing: filtering, sorting, aggregating lists of results is natural in Python code but awkward with JSON tool calls (would require multiple sequential calls).

Conditional logic: "search for X; if no results, search for Y instead; if still nothing, use Z" is two lines of Python but requires multiple agent loop iterations with tool dispatch.

Batch operations: process 50 items with a list comprehension in one code block vs 50 sequential tool calls.

Calculations: arithmetic, string manipulation, date calculations — native in Python, awkward via JSON tools.

Use tool dispatch when: security requirements preclude running arbitrary code, tools have important side effects (sending emails, writing to DB) that should be explicitly approved, or you're using models that aren't well-tuned for code generation.

SECTION 07

Gotchas

Code execution latency adds up. Each code block requires parsing, sandboxing, execution, and result formatting. If the agent generates 10 code blocks per query, this overhead is significant. Cache SDK call results aggressively.

The model can import libraries you didn't intend. Even with restricted builtins, __import__ might be available. Explicitly block imports in your SDK docs and in the execution environment.

LLMs can write infinite loops. Always set a timeout on code execution: signal.alarm(30) or a threading timeout. An infinite loop in exec() will block your server thread indefinitely without a timeout.

SDK documentation is your prompt. The model writes code against the docstrings you provide. Missing a return type or an example will cause the model to guess, often incorrectly. Treat SDK docs with the same care as system prompts.

Sandboxing Approach	Isolation Level	Overhead	Best For
subprocess (no sandbox)	None	None	Trusted internal tools only
RestrictedPython	Python AST-level	Low	Simple scripts, no system calls
Docker container	Process + filesystem	Medium (500ms+ startup)	General-purpose untrusted code
gVisor / Firecracker	Kernel-level VM	Medium-high	High-security multi-tenant
WASM (e.g. Pyodide)	Full browser/runtime sandbox	High (cold start)	Client-side or ultra-secure server

Code Mode

Table of Contents

Code mode vs JSON tool dispatch

The code-as-plan philosophy

Implementing a code executor

Typed SDK for the agent

Sandboxing code execution

When code mode beats tool dispatch

Gotchas

Code Mode Safety and Sandboxing

Code Mode

Table of Contents

Code mode vs JSON tool dispatch

The code-as-plan philosophy

Implementing a code executor

Typed SDK for the agent

Sandboxing code execution

When code mode beats tool dispatch

Gotchas

Code Mode Safety and Sandboxing

Related concepts