SECTION 01
What Are Coding Agents?
A coding agent is an LLM system that autonomously writes, executes, evaluates, and iterates on code to fulfill a high-level goal. Rather than generating a static code block, the agent enters a feedback loop: write code โ run it in a sandbox โ read the output (stdout, stderr, test results) โ patch errors โ retry.
This simple loop is surprisingly powerful. Given "implement a binary search tree with unit tests and get all tests passing," a capable coding agent can independently write the implementation, run pytest, interpret failures, fix edge cases, and produce a passing test suite โ without human intervention.
The spectrum of coding agents ranges from:
- Autocomplete (Copilot-style): Next-token / next-line prediction. Human always in the loop.
- Inline editing (Cursor, Copilot Chat): LLM rewrites a selected region of code on instruction. Single-shot, no execution feedback.
- Agent-assisted coding: LLM writes code, runs it, and reports results. Human approves each change.
- Autonomous coding agents (Devin, SWE-agent): LLM independently reads issues, writes code, runs tests, opens pull requests. Human reviews the PR.
Core insight: The agent loop converts code generation from a one-shot bet into an iterative optimization. The agent doesn't need to be right the first time โ it needs to be able to recognize and fix errors.
SECTION 02
The Agent Loop
The fundamental coding agent loop has four components: observe, plan, act, evaluate.
High-level goal (natural language)
โ
โโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโ
โ Coding Agent (LLM) โ
โ โ
โ 1. Plan: break into steps โ
โ 2. Write code for step N โ
โ 3. Execute in sandbox โ
โ 4. Read output / errors โ
โ 5. Patch & retry if needed โ
โ 6. Advance to step N+1 โ
โโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ Tools
โโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
โ bash / python / pytest โ
โ read_file / write_file โ
โ git commit / git push โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Planning phase: Before writing code, strong agents decompose the goal into concrete steps ("create the data model", "write the parser", "add unit tests"). This decomposition prevents the agent from writing large monolithic files that are hard to debug.
Execution feedback: The critical enabler is a code execution tool. The agent must be able to call run_python(code) or bash("pytest tests/") and receive actual output. Without execution feedback, the agent is writing in the dark.
Error handling strategy: When the agent encounters a failing test or exception, it should: (1) read the full traceback, (2) identify the root cause, (3) write a targeted fix, (4) re-run to verify. Agents that try to fix every error in one large edit often introduce new bugs.
Best practice: Give the agent a clear success criterion before it starts. "All tests in tests/ must pass" is a better stopping condition than "implement the feature." An objective success signal prevents the agent from stopping prematurely or looping forever.
SECTION 03
Sandboxing & Execution
Sandboxing is the most critical safety component of a coding agent. The agent must be able to run arbitrary code during development โ but that code must not be able to damage the host system, exfiltrate data, or make unauthorized network calls.
Sandboxing options (by isolation level):
- subprocess with limits:
subprocess.run with timeout, resource limits via ulimit. Simple but offers minimal isolation. Suitable for trusted scripts.
- Docker containers: Run code in an ephemeral container with no host mounts, no network (or restricted network), and a read-only filesystem outside the workspace. Strong isolation. Used by SWE-agent and similar systems.
- E2B (e2b.dev): Managed cloud sandboxes with Python/Node runtimes, persistent across agent steps, and destroyed after the session. Easy to integrate via SDK.
- Modal: Serverless sandboxes with fine-grained resource control. Good for heavy compute tasks (GPU, long-running jobs).
- Pyodide (browser): Python in WebAssembly. No server needed. Limited to pure-Python packages. Useful for demos and lightweight agents.
# Minimal sandboxed executor with subprocess
import subprocess, textwrap
def run_python(code: str, timeout: int = 30) -> str:
result = subprocess.run(
["python3", "-c", code],
capture_output=True,
text=True,
timeout=timeout,
# Resource limits on Linux:
# preexec_fn=lambda: resource.setrlimit(resource.RLIMIT_AS, (256*1024*1024, -1))
)
output = result.stdout + result.stderr
return output.strip() or "(no output)"
# E2B sandbox (managed, isolated)
# from e2b_code_interpreter import Sandbox
# with Sandbox() as sbx:
# execution = sbx.run_code("print(2 ** 32)")
# print(execution.text) # "4294967296"
Security rule: Never run agent-generated code in the same process or environment as production data, API keys, or customer systems. Always treat agent code as untrusted input from an external source.
SECTION 04
Tool Design
The tools you give a coding agent directly determine what it can accomplish. Too few tools and the agent cannot complete real tasks. Too many and the agent spends tokens figuring out which tool to call. The right set is task-specific, but some tools are nearly universal.
Core tool set for most coding agents:
- run_python / run_bash: Execute code, return stdout + stderr. The most important tool.
- read_file(path): Read a file from the workspace. Essential for understanding existing code.
- write_file(path, content): Write or overwrite a file. The agent's primary output mechanism.
- list_dir(path): Explore the repository structure. Helps the agent orient in large codebases.
- run_tests(pattern): Run pytest or another test runner, return results. Used to verify progress.
Optional tools for advanced agents:
- git_diff / git_commit / git_push: Source control integration for PR-creating agents.
- web_search / fetch_url: Find documentation, Stack Overflow answers, or package APIs.
- install_package(name): Install dependencies during development. Use with caution.
- code_search(pattern): Semantic or grep-based search across the codebase. Critical for large repos.
tools = [
{
"name": "run_python",
"description": "Execute Python code in a sandbox. Returns stdout + stderr.",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python source code to run"}
},
"required": ["code"]
}
},
{
"name": "write_file",
"description": "Write content to a file in the workspace.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["path", "content"]
}
},
{
"name": "read_file",
"description": "Read the contents of a file.",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
}
]
Design principle: Prefer narrowly scoped tools over a single omnibus bash tool. A dedicated run_tests() tool that only runs the test suite (never arbitrary shell commands) is safer and easier for the agent to use correctly than a raw bash executor.
SECTION 05
Context Management
Large codebases exceed what fits in a context window. A coding agent working on a 50,000-line codebase cannot read every file before acting. Effective context management is the difference between an agent that gets lost and one that navigates a real repository.
Strategies for fitting codebase context:
- File tree summarization: Give the agent a compressed view of the repository structure (directory tree + short description of each module) rather than raw file contents.
- Targeted file reading: Equip the agent with
read_file and code_search tools and let it pull exactly what it needs. Agents that are told "read only files relevant to your task" use context far more efficiently.
- Chunked editing: For large files, show only the relevant section. Provide the agent with a
view_range(file, start_line, end_line) tool.
- Conversation compression: After every N turns, summarize earlier tool outputs (long test runs, verbose logs) into a compact summary. This prevents context overflow in long agent sessions.
# Efficient context strategy: repo map + on-demand reading
def get_repo_map(root_dir: str, max_depth: int = 3) -> str:
"""Generate a compact tree of the repository."""
import os
lines = []
for dirpath, dirnames, filenames in os.walk(root_dir):
depth = dirpath.replace(root_dir, '').count(os.sep)
if depth > max_depth:
continue
indent = ' ' * depth
lines.append(f"{indent}{os.path.basename(dirpath)}/")
sub_indent = ' ' * (depth + 1)
for fname in sorted(filenames):
if not fname.startswith('.'):
lines.append(f"{sub_indent}{fname}")
return '\n'.join(lines)
# Pass repo_map in system prompt; let agent call read_file as needed
SECTION 06
Production Patterns
Deploying coding agents in production requires solving reliability, auditability, and cost control problems that don't appear in demos.
Human-in-the-loop checkpoints: Even "autonomous" coding agents benefit from checkpoints. A common pattern is: agent works autonomously on a branch โ creates a draft PR โ human reviews and approves or requests changes. This combines agent speed with human quality control.
Task scoping: Coding agents succeed when tasks are clearly scoped with verifiable success criteria. Tasks like "fix the bug in issue #342 so that the existing test suite passes" are much more reliable than "improve the codebase." The agent needs a signal for when it is done.
Cost controls: Long agent sessions are expensive. Set explicit turn limits (e.g. 30 tool calls max), token budgets, and time limits. Add cost tracking to your agent harness. Consider using a fast, cheap model for context setup and a strong model only for the actual coding turns.
Retry and escalation: When an agent exceeds its turn limit without succeeding, escalate to a human rather than silently failing. Log the full conversation so a developer can understand what went wrong.
import anthropic, subprocess
client = anthropic.Anthropic()
def coding_agent(task: str, max_turns: int = 30) -> str:
messages = [{"role": "user", "content": task}]
tools = [
{"name": "run_python", "description": "Run Python code in sandbox.",
"input_schema": {"type": "object", "properties": {"code": {"type": "string"}}, "required": ["code"]}}
]
for turn in range(max_turns):
resp = client.messages.create(
model="claude-opus-4-5", max_tokens=4096,
tools=tools, messages=messages
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn":
for block in reversed(resp.content):
if hasattr(block, "text"):
return block.text
tool_results = []
for block in resp.content:
if block.type == "tool_use" and block.name == "run_python":
out = subprocess.run(
["python3", "-c", block.input["code"]],
capture_output=True, text=True, timeout=30
)
result = (out.stdout + out.stderr).strip() or "(no output)"
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
if tool_results:
messages.append({"role": "user", "content": tool_results})
return f"Agent exceeded {max_turns} turns. Manual review needed."
SECTION 07
Evaluation & Safety
Evaluating coding agents is harder than evaluating code generation benchmarks because agent performance depends on the full loop: initial code quality, error recovery, and final test pass rate.
Evaluation metrics:
- Task completion rate: Percentage of tasks where all acceptance tests pass on the final agent output. The SWE-bench benchmark measures this for real GitHub issues.
- Turns to completion: How many agent turns are needed? Fewer turns = lower cost and latency.
- Error recovery rate: Given a task the agent initially fails, what fraction does it recover from with additional turns?
- Code quality: Does the agent's code pass linting, type checking, and code review criteria beyond just the test suite?
Safety considerations:
- Always run agent code in an isolated sandbox with no access to production data or credentials.
- Review all git commits before merging to main branches.
- Set network restrictions in the sandbox to prevent data exfiltration.
- Audit agent tool calls for unexpected resource usage (excessive API calls, large file writes).
SWE-bench context: SWE-bench (Jimenez et al., 2024) is the standard coding agent benchmark โ 2,294 real GitHub issues from popular Python repos. Top agents (2025) solve 40โ55% of issues. Human software engineers solve ~86%. The gap is real but narrowing fast.