Tool Use

Code Mode

LLM writes and executes Python code against a typed SDK instead of dispatching JSON tool calls — code is the plan, composable and compact, using far fewer context tokens than per-call tool dispatch.

Code
As the plan
Fewer tokens
Than JSON dispatch
Composable
Tool logic

Table of Contents

SECTION 01

Code mode vs JSON tool dispatch

Standard function calling: the model outputs {"name": "search", "args": {"query": "..."}}. Your code parses this JSON, routes to the right function, handles argument validation, runs the function, and returns the result. Each tool call is a round-trip.

Code mode: the model outputs Python code that calls your API directly. Instead of multiple JSON dispatch round-trips, the model writes a complete program:

results = search("LLM benchmarks 2024")
filtered = [r for r in results if r.date > "2024-01-01"]
summary = summarise(filtered[:3])
final_answer(summary)

This is more powerful: the model can use loops, conditionals, list comprehensions, and intermediate variables without a round-trip per operation. It's also more compact — one code block does what would take 5+ JSON tool calls. The SmolAgents and Code Interpreter patterns both use this approach.

SECTION 02

The code-as-plan philosophy

In tool dispatch, the model's "plan" is implicit — it's revealed one tool call at a time. In code mode, the model's plan is the code itself, making it explicit, readable, and auditable before execution.

This has three key benefits: transparency (you can read the code and understand what the agent intends to do before running it), composability (Python operations can be combined in any way — the model isn't limited to single tool calls), and efficiency (complex logic that would require 10+ JSON tool calls becomes a 20-line function).

The tradeoff is security: running arbitrary LLM-generated code is more dangerous than running a pre-defined list of tools. Every code-mode deployment needs a sandboxing strategy.

SECTION 03

Implementing a code executor

import anthropic
import re
import sys
from io import StringIO

client = anthropic.Anthropic()

# The SDK the agent can call (exposed to the agent)
class AgentSDK:
    '''Typed SDK available to the agent. Import as `sdk` in generated code.'''

    def search(self, query: str) -> list[dict]:
        '''Search the web. Returns list of {title, url, snippet}.'''
        return [{"title": f"Result for {query}", "url": "https://example.com", "snippet": "..."}]

    def calculate(self, expression: str) -> float:
        '''Safely evaluate a math expression.'''
        import ast, operator
        ops = {ast.Add: operator.add, ast.Sub: operator.sub,
               ast.Mult: operator.mul, ast.Div: operator.truediv}
        def safe_eval(node):
            if isinstance(node, ast.Num): return node.n
            if isinstance(node, ast.BinOp): return ops[type(node.op)](safe_eval(node.left), safe_eval(node.right))
            raise ValueError(f"Unsupported: {node}")
        return safe_eval(ast.parse(expression, mode='eval').body)

    def final_answer(self, answer: str) -> str:
        return answer

sdk = AgentSDK()

def execute_code(code: str) -> tuple[str, str | None]:
    '''Execute code in a restricted namespace. Returns (stdout, error).'''
    # Capture stdout
    old_stdout = sys.stdout
    sys.stdout = buffer = StringIO()
    error = None
    try:
        exec(code, {"sdk": sdk, "__builtins__": {"print": print, "len": len, "range": range, "str": str, "int": int, "float": float, "list": list, "dict": dict}})
    except Exception as e:
        error = str(e)
    finally:
        sys.stdout = old_stdout
    return buffer.getvalue(), error

SYSTEM = '''You are a code-mode agent. To answer the user, write Python code using the `sdk` object.
Available methods: sdk.search(query), sdk.calculate(expression), sdk.final_answer(answer)
Output code in a ```python block. The code will be executed and output returned to you.'''

def code_agent(user_query: str) -> str:
    messages = [{"role": "user", "content": user_query}]
    for _ in range(5):
        resp = client.messages.create(
            model="claude-sonnet-4-5", max_tokens=1024,
            system=SYSTEM, messages=messages
        )
        text = resp.content[0].text
        # Extract code block
        match = re.search(r'```python
(.*?)```', text, re.DOTALL)
        if not match:
            return text  # No code — direct answer
        code = match.group(1)
        stdout, error = execute_code(code)
        messages.append({"role": "assistant", "content": text})
        if error:
            messages.append({"role": "user", "content": f"Error: {error}"})
        else:
            messages.append({"role": "user", "content": f"Output: {stdout}"})
            if "final_answer" in code:
                return stdout
    return "Max steps reached"

print(code_agent("What is 1234 * 5678?"))
SECTION 04

Typed SDK for the agent

The quality of code-mode agents depends heavily on how well the SDK is documented. The model generates code based on the docstrings and type hints in the system prompt:

SDK_DOCS = '''
Available SDK (import as `sdk`):

sdk.search(query: str) -> list[dict]
    Search web. Returns [{"title": str, "url": str, "snippet": str}].
    Use for: current information, facts, news.

sdk.database_query(sql: str) -> list[dict]
    Execute a read-only SQL query on the product database.
    Tables: products(id, name, price, stock), orders(id, user_id, product_id, status)
    IMPORTANT: SELECT only — no INSERT/UPDATE/DELETE.

sdk.send_notification(user_id: str, message: str) -> bool
    Send a push notification. Returns True if successful.
    Rate limit: max 1 per user per hour.

sdk.final_answer(answer: str | dict) -> None
    Call when you have the answer. Pass a string or dict.
    ALWAYS call this to end the agent loop.
'''

Include usage examples, return types, rate limits, and explicit warnings about what NOT to do. The model follows the docs precisely.

SECTION 05

Sandboxing code execution

Running LLM-generated code is dangerous without sandboxing. Levels of isolation from least to most secure:

Restricted builtins (minimal protection): pass a limited __builtins__ dict to exec(). Prevents obvious attacks but determined code can escape. Never use in production for untrusted models.

Docker container: run the code executor in a Docker container with no network access, read-only filesystem, and resource limits. The agent's calls to your SDK go through a secure channel:

# docker-compose.yml
services:
  code-executor:
    image: python:3.11-slim
    read_only: true
    network_mode: none
    mem_limit: 256m
    cpus: 0.5
    volumes:
      - /tmp/agent-sandbox:/tmp:rw

E2B (cloud sandbox): managed cloud sandboxes specifically designed for LLM code execution. Pay-per-use, no infrastructure management, with file I/O and network access controls. Best for production.

WebAssembly (Pyodide): run Python in a Wasm sandbox in the browser or a confined runtime. No network access by default, strong isolation, but limited library support.

SECTION 06

When code mode beats tool dispatch

Complex data processing: filtering, sorting, aggregating lists of results is natural in Python code but awkward with JSON tool calls (would require multiple sequential calls).

Conditional logic: "search for X; if no results, search for Y instead; if still nothing, use Z" is two lines of Python but requires multiple agent loop iterations with tool dispatch.

Batch operations: process 50 items with a list comprehension in one code block vs 50 sequential tool calls.

Calculations: arithmetic, string manipulation, date calculations — native in Python, awkward via JSON tools.

Use tool dispatch when: security requirements preclude running arbitrary code, tools have important side effects (sending emails, writing to DB) that should be explicitly approved, or you're using models that aren't well-tuned for code generation.

SECTION 07

Gotchas

Code execution latency adds up. Each code block requires parsing, sandboxing, execution, and result formatting. If the agent generates 10 code blocks per query, this overhead is significant. Cache SDK call results aggressively.

The model can import libraries you didn't intend. Even with restricted builtins, __import__ might be available. Explicitly block imports in your SDK docs and in the execution environment.

LLMs can write infinite loops. Always set a timeout on code execution: signal.alarm(30) or a threading timeout. An infinite loop in exec() will block your server thread indefinitely without a timeout.

SDK documentation is your prompt. The model writes code against the docstrings you provide. Missing a return type or an example will cause the model to guess, often incorrectly. Treat SDK docs with the same care as system prompts.

SECTION 08

Code Mode Safety and Sandboxing

Sandboxing ApproachIsolation LevelOverheadBest For
subprocess (no sandbox)NoneNoneTrusted internal tools only
RestrictedPythonPython AST-levelLowSimple scripts, no system calls
Docker containerProcess + filesystemMedium (500ms+ startup)General-purpose untrusted code
gVisor / FirecrackerKernel-level VMMedium-highHigh-security multi-tenant
WASM (e.g. Pyodide)Full browser/runtime sandboxHigh (cold start)Client-side or ultra-secure server
import subprocess, tempfile, os, textwrap

def run_sandboxed(code: str, timeout: int = 10) -> dict:
    """Execute Python code in a subprocess with resource limits."""
    # Wrap code to capture stdout and stderr
    wrapper = textwrap.dedent(f"""
import sys, io
_stdout = io.StringIO()
sys.stdout = _stdout
try:
{textwrap.indent(code, "    ")}
except Exception as e:
    print(f"ERROR: {{e}}", file=sys.__stderr__)
finally:
    sys.stdout = sys.__stdout__
    print(_stdout.getvalue(), end="")
""")
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(wrapper); tmp = f.name
    try:
        result = subprocess.run(
            ["python3", "-u", tmp],
            capture_output=True, text=True, timeout=timeout,
            env={**os.environ, "PYTHONPATH": ""}  # strip custom imports
        )
        return {"stdout": result.stdout, "stderr": result.stderr,
                "returncode": result.returncode}
    except subprocess.TimeoutExpired:
        return {"stdout": "", "stderr": "Execution timed out", "returncode": -1}
    finally:
        os.unlink(tmp)

For production code-mode agents, always use Docker or a proper sandbox over subprocess. Set resource limits: --memory=256m --cpus=0.5 --network=none prevent runaway memory use, CPU exhaustion, and exfiltration. Rotate containers after each execution to prevent state persistence across sessions.