Agents ยท Building with AI

Coding Agents

AI agents that autonomously write, run, debug, and iterate on code โ€” from autocomplete to full autonomous software engineers

Write โ†’ Run โ†’ Fix
Core Loop
Sandboxed
Execution
Tool-Driven
Architecture

Table of Contents

SECTION 01

What Are Coding Agents?

A coding agent is an LLM system that autonomously writes, executes, evaluates, and iterates on code to fulfill a high-level goal. Rather than generating a static code block, the agent enters a feedback loop: write code โ†’ run it in a sandbox โ†’ read the output (stdout, stderr, test results) โ†’ patch errors โ†’ retry.

This simple loop is surprisingly powerful. Given "implement a binary search tree with unit tests and get all tests passing," a capable coding agent can independently write the implementation, run pytest, interpret failures, fix edge cases, and produce a passing test suite โ€” without human intervention.

The spectrum of coding agents ranges from:

Core insight: The agent loop converts code generation from a one-shot bet into an iterative optimization. The agent doesn't need to be right the first time โ€” it needs to be able to recognize and fix errors.
SECTION 02

The Agent Loop

The fundamental coding agent loop has four components: observe, plan, act, evaluate.

High-level goal (natural language) โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Coding Agent (LLM) โ”‚ โ”‚ โ”‚ โ”‚ 1. Plan: break into steps โ”‚ โ”‚ 2. Write code for step N โ”‚ โ”‚ 3. Execute in sandbox โ”‚ โ”‚ 4. Read output / errors โ”‚ โ”‚ 5. Patch & retry if needed โ”‚ โ”‚ 6. Advance to step N+1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Tools โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ bash / python / pytest โ”‚ โ”‚ read_file / write_file โ”‚ โ”‚ git commit / git push โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Planning phase: Before writing code, strong agents decompose the goal into concrete steps ("create the data model", "write the parser", "add unit tests"). This decomposition prevents the agent from writing large monolithic files that are hard to debug.

Execution feedback: The critical enabler is a code execution tool. The agent must be able to call run_python(code) or bash("pytest tests/") and receive actual output. Without execution feedback, the agent is writing in the dark.

Error handling strategy: When the agent encounters a failing test or exception, it should: (1) read the full traceback, (2) identify the root cause, (3) write a targeted fix, (4) re-run to verify. Agents that try to fix every error in one large edit often introduce new bugs.

Best practice: Give the agent a clear success criterion before it starts. "All tests in tests/ must pass" is a better stopping condition than "implement the feature." An objective success signal prevents the agent from stopping prematurely or looping forever.
SECTION 03

Sandboxing & Execution

Sandboxing is the most critical safety component of a coding agent. The agent must be able to run arbitrary code during development โ€” but that code must not be able to damage the host system, exfiltrate data, or make unauthorized network calls.

Sandboxing options (by isolation level):

# Minimal sandboxed executor with subprocess import subprocess, textwrap def run_python(code: str, timeout: int = 30) -> str: result = subprocess.run( ["python3", "-c", code], capture_output=True, text=True, timeout=timeout, # Resource limits on Linux: # preexec_fn=lambda: resource.setrlimit(resource.RLIMIT_AS, (256*1024*1024, -1)) ) output = result.stdout + result.stderr return output.strip() or "(no output)" # E2B sandbox (managed, isolated) # from e2b_code_interpreter import Sandbox # with Sandbox() as sbx: # execution = sbx.run_code("print(2 ** 32)") # print(execution.text) # "4294967296"
Security rule: Never run agent-generated code in the same process or environment as production data, API keys, or customer systems. Always treat agent code as untrusted input from an external source.
SECTION 04

Tool Design

The tools you give a coding agent directly determine what it can accomplish. Too few tools and the agent cannot complete real tasks. Too many and the agent spends tokens figuring out which tool to call. The right set is task-specific, but some tools are nearly universal.

Core tool set for most coding agents:

Optional tools for advanced agents:

tools = [ { "name": "run_python", "description": "Execute Python code in a sandbox. Returns stdout + stderr.", "input_schema": { "type": "object", "properties": { "code": {"type": "string", "description": "Python source code to run"} }, "required": ["code"] } }, { "name": "write_file", "description": "Write content to a file in the workspace.", "input_schema": { "type": "object", "properties": { "path": {"type": "string"}, "content": {"type": "string"} }, "required": ["path", "content"] } }, { "name": "read_file", "description": "Read the contents of a file.", "input_schema": { "type": "object", "properties": {"path": {"type": "string"}}, "required": ["path"] } } ]
Design principle: Prefer narrowly scoped tools over a single omnibus bash tool. A dedicated run_tests() tool that only runs the test suite (never arbitrary shell commands) is safer and easier for the agent to use correctly than a raw bash executor.
SECTION 05

Context Management

Large codebases exceed what fits in a context window. A coding agent working on a 50,000-line codebase cannot read every file before acting. Effective context management is the difference between an agent that gets lost and one that navigates a real repository.

Strategies for fitting codebase context:

# Efficient context strategy: repo map + on-demand reading def get_repo_map(root_dir: str, max_depth: int = 3) -> str: """Generate a compact tree of the repository.""" import os lines = [] for dirpath, dirnames, filenames in os.walk(root_dir): depth = dirpath.replace(root_dir, '').count(os.sep) if depth > max_depth: continue indent = ' ' * depth lines.append(f"{indent}{os.path.basename(dirpath)}/") sub_indent = ' ' * (depth + 1) for fname in sorted(filenames): if not fname.startswith('.'): lines.append(f"{sub_indent}{fname}") return '\n'.join(lines) # Pass repo_map in system prompt; let agent call read_file as needed
SECTION 06

Production Patterns

Deploying coding agents in production requires solving reliability, auditability, and cost control problems that don't appear in demos.

Human-in-the-loop checkpoints: Even "autonomous" coding agents benefit from checkpoints. A common pattern is: agent works autonomously on a branch โ†’ creates a draft PR โ†’ human reviews and approves or requests changes. This combines agent speed with human quality control.

Task scoping: Coding agents succeed when tasks are clearly scoped with verifiable success criteria. Tasks like "fix the bug in issue #342 so that the existing test suite passes" are much more reliable than "improve the codebase." The agent needs a signal for when it is done.

Cost controls: Long agent sessions are expensive. Set explicit turn limits (e.g. 30 tool calls max), token budgets, and time limits. Add cost tracking to your agent harness. Consider using a fast, cheap model for context setup and a strong model only for the actual coding turns.

Retry and escalation: When an agent exceeds its turn limit without succeeding, escalate to a human rather than silently failing. Log the full conversation so a developer can understand what went wrong.

import anthropic, subprocess client = anthropic.Anthropic() def coding_agent(task: str, max_turns: int = 30) -> str: messages = [{"role": "user", "content": task}] tools = [ {"name": "run_python", "description": "Run Python code in sandbox.", "input_schema": {"type": "object", "properties": {"code": {"type": "string"}}, "required": ["code"]}} ] for turn in range(max_turns): resp = client.messages.create( model="claude-opus-4-5", max_tokens=4096, tools=tools, messages=messages ) messages.append({"role": "assistant", "content": resp.content}) if resp.stop_reason == "end_turn": for block in reversed(resp.content): if hasattr(block, "text"): return block.text tool_results = [] for block in resp.content: if block.type == "tool_use" and block.name == "run_python": out = subprocess.run( ["python3", "-c", block.input["code"]], capture_output=True, text=True, timeout=30 ) result = (out.stdout + out.stderr).strip() or "(no output)" tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result }) if tool_results: messages.append({"role": "user", "content": tool_results}) return f"Agent exceeded {max_turns} turns. Manual review needed."
SECTION 07

Evaluation & Safety

Evaluating coding agents is harder than evaluating code generation benchmarks because agent performance depends on the full loop: initial code quality, error recovery, and final test pass rate.

Evaluation metrics:

Safety considerations:

SWE-bench context: SWE-bench (Jimenez et al., 2024) is the standard coding agent benchmark โ€” 2,294 real GitHub issues from popular Python repos. Top agents (2025) solve 40โ€“55% of issues. Human software engineers solve ~86%. The gap is real but narrowing fast.