LLM writes and executes Python code against a typed SDK instead of dispatching JSON tool calls — code is the plan, composable and compact, using far fewer context tokens than per-call tool dispatch.
Standard function calling: the model outputs {"name": "search", "args": {"query": "..."}}. Your code parses this JSON, routes to the right function, handles argument validation, runs the function, and returns the result. Each tool call is a round-trip.
Code mode: the model outputs Python code that calls your API directly. Instead of multiple JSON dispatch round-trips, the model writes a complete program:
results = search("LLM benchmarks 2024")
filtered = [r for r in results if r.date > "2024-01-01"]
summary = summarise(filtered[:3])
final_answer(summary)
This is more powerful: the model can use loops, conditionals, list comprehensions, and intermediate variables without a round-trip per operation. It's also more compact — one code block does what would take 5+ JSON tool calls. The SmolAgents and Code Interpreter patterns both use this approach.
In tool dispatch, the model's "plan" is implicit — it's revealed one tool call at a time. In code mode, the model's plan is the code itself, making it explicit, readable, and auditable before execution.
This has three key benefits: transparency (you can read the code and understand what the agent intends to do before running it), composability (Python operations can be combined in any way — the model isn't limited to single tool calls), and efficiency (complex logic that would require 10+ JSON tool calls becomes a 20-line function).
The tradeoff is security: running arbitrary LLM-generated code is more dangerous than running a pre-defined list of tools. Every code-mode deployment needs a sandboxing strategy.
import anthropic
import re
import sys
from io import StringIO
client = anthropic.Anthropic()
# The SDK the agent can call (exposed to the agent)
class AgentSDK:
'''Typed SDK available to the agent. Import as `sdk` in generated code.'''
def search(self, query: str) -> list[dict]:
'''Search the web. Returns list of {title, url, snippet}.'''
return [{"title": f"Result for {query}", "url": "https://example.com", "snippet": "..."}]
def calculate(self, expression: str) -> float:
'''Safely evaluate a math expression.'''
import ast, operator
ops = {ast.Add: operator.add, ast.Sub: operator.sub,
ast.Mult: operator.mul, ast.Div: operator.truediv}
def safe_eval(node):
if isinstance(node, ast.Num): return node.n
if isinstance(node, ast.BinOp): return ops[type(node.op)](safe_eval(node.left), safe_eval(node.right))
raise ValueError(f"Unsupported: {node}")
return safe_eval(ast.parse(expression, mode='eval').body)
def final_answer(self, answer: str) -> str:
return answer
sdk = AgentSDK()
def execute_code(code: str) -> tuple[str, str | None]:
'''Execute code in a restricted namespace. Returns (stdout, error).'''
# Capture stdout
old_stdout = sys.stdout
sys.stdout = buffer = StringIO()
error = None
try:
exec(code, {"sdk": sdk, "__builtins__": {"print": print, "len": len, "range": range, "str": str, "int": int, "float": float, "list": list, "dict": dict}})
except Exception as e:
error = str(e)
finally:
sys.stdout = old_stdout
return buffer.getvalue(), error
SYSTEM = '''You are a code-mode agent. To answer the user, write Python code using the `sdk` object.
Available methods: sdk.search(query), sdk.calculate(expression), sdk.final_answer(answer)
Output code in a ```python block. The code will be executed and output returned to you.'''
def code_agent(user_query: str) -> str:
messages = [{"role": "user", "content": user_query}]
for _ in range(5):
resp = client.messages.create(
model="claude-sonnet-4-5", max_tokens=1024,
system=SYSTEM, messages=messages
)
text = resp.content[0].text
# Extract code block
match = re.search(r'```python
(.*?)```', text, re.DOTALL)
if not match:
return text # No code — direct answer
code = match.group(1)
stdout, error = execute_code(code)
messages.append({"role": "assistant", "content": text})
if error:
messages.append({"role": "user", "content": f"Error: {error}"})
else:
messages.append({"role": "user", "content": f"Output: {stdout}"})
if "final_answer" in code:
return stdout
return "Max steps reached"
print(code_agent("What is 1234 * 5678?"))
The quality of code-mode agents depends heavily on how well the SDK is documented. The model generates code based on the docstrings and type hints in the system prompt:
SDK_DOCS = '''
Available SDK (import as `sdk`):
sdk.search(query: str) -> list[dict]
Search web. Returns [{"title": str, "url": str, "snippet": str}].
Use for: current information, facts, news.
sdk.database_query(sql: str) -> list[dict]
Execute a read-only SQL query on the product database.
Tables: products(id, name, price, stock), orders(id, user_id, product_id, status)
IMPORTANT: SELECT only — no INSERT/UPDATE/DELETE.
sdk.send_notification(user_id: str, message: str) -> bool
Send a push notification. Returns True if successful.
Rate limit: max 1 per user per hour.
sdk.final_answer(answer: str | dict) -> None
Call when you have the answer. Pass a string or dict.
ALWAYS call this to end the agent loop.
'''
Include usage examples, return types, rate limits, and explicit warnings about what NOT to do. The model follows the docs precisely.
Running LLM-generated code is dangerous without sandboxing. Levels of isolation from least to most secure:
Restricted builtins (minimal protection): pass a limited __builtins__ dict to exec(). Prevents obvious attacks but determined code can escape. Never use in production for untrusted models.
Docker container: run the code executor in a Docker container with no network access, read-only filesystem, and resource limits. The agent's calls to your SDK go through a secure channel:
# docker-compose.yml
services:
code-executor:
image: python:3.11-slim
read_only: true
network_mode: none
mem_limit: 256m
cpus: 0.5
volumes:
- /tmp/agent-sandbox:/tmp:rw
E2B (cloud sandbox): managed cloud sandboxes specifically designed for LLM code execution. Pay-per-use, no infrastructure management, with file I/O and network access controls. Best for production.
WebAssembly (Pyodide): run Python in a Wasm sandbox in the browser or a confined runtime. No network access by default, strong isolation, but limited library support.
Complex data processing: filtering, sorting, aggregating lists of results is natural in Python code but awkward with JSON tool calls (would require multiple sequential calls).
Conditional logic: "search for X; if no results, search for Y instead; if still nothing, use Z" is two lines of Python but requires multiple agent loop iterations with tool dispatch.
Batch operations: process 50 items with a list comprehension in one code block vs 50 sequential tool calls.
Calculations: arithmetic, string manipulation, date calculations — native in Python, awkward via JSON tools.
Use tool dispatch when: security requirements preclude running arbitrary code, tools have important side effects (sending emails, writing to DB) that should be explicitly approved, or you're using models that aren't well-tuned for code generation.
Code execution latency adds up. Each code block requires parsing, sandboxing, execution, and result formatting. If the agent generates 10 code blocks per query, this overhead is significant. Cache SDK call results aggressively.
The model can import libraries you didn't intend. Even with restricted builtins, __import__ might be available. Explicitly block imports in your SDK docs and in the execution environment.
LLMs can write infinite loops. Always set a timeout on code execution: signal.alarm(30) or a threading timeout. An infinite loop in exec() will block your server thread indefinitely without a timeout.
SDK documentation is your prompt. The model writes code against the docstrings you provide. Missing a return type or an example will cause the model to guess, often incorrectly. Treat SDK docs with the same care as system prompts.
| Sandboxing Approach | Isolation Level | Overhead | Best For |
|---|---|---|---|
| subprocess (no sandbox) | None | None | Trusted internal tools only |
| RestrictedPython | Python AST-level | Low | Simple scripts, no system calls |
| Docker container | Process + filesystem | Medium (500ms+ startup) | General-purpose untrusted code |
| gVisor / Firecracker | Kernel-level VM | Medium-high | High-security multi-tenant |
| WASM (e.g. Pyodide) | Full browser/runtime sandbox | High (cold start) | Client-side or ultra-secure server |
import subprocess, tempfile, os, textwrap
def run_sandboxed(code: str, timeout: int = 10) -> dict:
"""Execute Python code in a subprocess with resource limits."""
# Wrap code to capture stdout and stderr
wrapper = textwrap.dedent(f"""
import sys, io
_stdout = io.StringIO()
sys.stdout = _stdout
try:
{textwrap.indent(code, " ")}
except Exception as e:
print(f"ERROR: {{e}}", file=sys.__stderr__)
finally:
sys.stdout = sys.__stdout__
print(_stdout.getvalue(), end="")
""")
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(wrapper); tmp = f.name
try:
result = subprocess.run(
["python3", "-u", tmp],
capture_output=True, text=True, timeout=timeout,
env={**os.environ, "PYTHONPATH": ""} # strip custom imports
)
return {"stdout": result.stdout, "stderr": result.stderr,
"returncode": result.returncode}
except subprocess.TimeoutExpired:
return {"stdout": "", "stderr": "Execution timed out", "returncode": -1}
finally:
os.unlink(tmp)
For production code-mode agents, always use Docker or a proper sandbox over subprocess. Set resource limits: --memory=256m --cpus=0.5 --network=none prevent runaway memory use, CPU exhaustion, and exfiltration. Rotate containers after each execution to prevent state persistence across sessions.