Techniques for reliably extracting, parsing, and executing LLM-generated code while handling syntax errors, unsafe patterns, and test failures.
LLMs generate plausible-looking code. "Plausible-looking" is not the same as "correct" or "safe". Common failure modes:
from sklearn.utils import magical_function that doesn't exist.eval(), exec(), subprocess.call(user_input), file deletion.A validation pipeline catches these issues before the code reaches production or an end user.
LLMs often wrap code in markdown fences, add explanatory prose, or include multiple snippets. Extract reliably:
import re
def extract_code(response: str, language: str = "python") -> str:
'''Extract the first code block from a markdown-formatted LLM response.'''
# Try fenced code block first
pattern = rf"```{language}
(.*?)```"
matches = re.findall(pattern, response, re.DOTALL | re.IGNORECASE)
if matches:
return matches[0].strip()
# Fallback: any fenced block
pattern = r"```\w*
(.*?)```"
matches = re.findall(pattern, response, re.DOTALL)
if matches:
return matches[0].strip()
# Last resort: entire response (assume it's just code)
return response.strip()
# Usage
raw = """
Here's the implementation:
```python
def fibonacci(n: int) -> list[int]:
a, b = 0, 1
result = []
for _ in range(n):
result.append(a)
a, b = b, a + b
return result
```
"""
code = extract_code(raw)
Before executing anything, parse the code into an Abstract Syntax Tree. This catches syntax errors without running any code:
import ast
def validate_syntax(code: str) -> tuple[bool, str]:
'''Returns (is_valid, error_message).'''
try:
ast.parse(code)
return True, ""
except SyntaxError as e:
return False, f"Syntax error on line {e.lineno}: {e.msg}"
# Example
code = "def foo(x)
return x * 2" # missing colon
valid, error = validate_syntax(code)
print(valid, error)
# False Syntax error on line 1: expected ':'
You can also walk the AST to detect unsafe patterns without executing:
BANNED_CALLS = {"eval", "exec", "compile", "__import__", "open"}
BANNED_ATTRS = {"system", "popen"} # os.system, os.popen
def contains_unsafe_patterns(code: str) -> list[str]:
tree = ast.parse(code)
issues = []
for node in ast.walk(tree):
if isinstance(node, ast.Call):
if isinstance(node.func, ast.Name) and node.func.id in BANNED_CALLS:
issues.append(f"Banned call: {node.func.id}()")
if isinstance(node.func, ast.Attribute) and node.func.attr in BANNED_ATTRS:
issues.append(f"Banned attribute: .{node.func.attr}()")
return issues
For controlled execution, use RestrictedPython or a subprocess with resource limits:
import subprocess, sys, tempfile, os
def execute_sandboxed(code: str, timeout: int = 5) -> tuple[str, str, int]:
'''
Execute code in a subprocess with timeout.
Returns (stdout, stderr, returncode).
'''
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(code)
tmp_path = f.name
try:
result = subprocess.run(
[sys.executable, tmp_path],
capture_output=True,
text=True,
timeout=timeout,
)
return result.stdout, result.stderr, result.returncode
except subprocess.TimeoutExpired:
return "", "Execution timed out", 1
finally:
os.unlink(tmp_path)
stdout, stderr, rc = execute_sandboxed("print(sum(range(10)))")
print(stdout) # 45
print(rc) # 0
For higher security (production LLM code execution), use Docker containers or cloud sandboxes (AWS Lambda, Firecracker) that enforce filesystem and network isolation at the OS level.
Syntax is necessary but not sufficient. Validate the behaviour with test cases:
def validate_with_tests(code: str, test_cases: list[dict]) -> tuple[bool, list[str]]:
'''
Run generated code against test cases.
test_cases: [{"input": ..., "expected": ..., "description": "..."}]
Returns (all_passed, [failure_messages])
'''
# Compile the function into a namespace
namespace = {}
exec(compile(code, "", "exec"), namespace)
failures = []
for tc in test_cases:
# Assumes the code defines a function named 'solution'
func_name = [k for k in namespace if callable(namespace[k])][0]
func = namespace[func_name]
try:
result = func(*tc["input"]) if isinstance(tc["input"], tuple) else func(tc["input"])
if result != tc["expected"]:
failures.append(
f"{tc['description']}: expected {tc['expected']!r}, got {result!r}"
)
except Exception as e:
failures.append(f"{tc['description']}: raised {type(e).__name__}: {e}")
return len(failures) == 0, failures
import anthropic
client = anthropic.Anthropic()
def generate_and_validate(task: str, test_cases: list[dict], max_retries: int = 3) -> str:
messages = [{"role": "user", "content": f"Write a Python function that: {task}\nReturn ONLY the function code, no explanation."}]
for attempt in range(max_retries):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=messages
)
code = extract_code(response.content[0].text)
# Stage 1: syntax
valid, err = validate_syntax(code)
if not valid:
messages.append({"role": "assistant", "content": response.content[0].text})
messages.append({"role": "user", "content": f"The code has a syntax error: {err}\nPlease fix it and return ONLY the corrected code."})
continue
# Stage 2: safety
issues = contains_unsafe_patterns(code)
if issues:
messages.append({"role": "assistant", "content": response.content[0].text})
messages.append({"role": "user", "content": f"The code contains unsafe patterns: {issues}\nPlease rewrite without these."})
continue
# Stage 3: tests
passed, failures = validate_with_tests(code, test_cases)
if not passed:
messages.append({"role": "assistant", "content": response.content[0].text})
messages.append({"role": "user", "content": f"The code fails these test cases:\n" + "\n".join(failures) + "\nPlease fix the implementation."})
continue
return code # All checks passed
raise ValueError(f"Failed to generate valid code after {max_retries} attempts")
Before deploying any LLM code execution pipeline, verify:
resource.setrlimit or container limits.The AST-based checks above are a defence-in-depth layer, not a replacement for OS-level sandboxing.
Code output validation ensures that LLM-generated code meets functional and safety requirements before execution in production environments. The appropriate validation strategy depends on the risk profile of the use case — code run in a sandbox for exploration has different requirements than code automatically deployed to a production pipeline.
| Strategy | Catches | Misses | Latency | Best For |
|---|---|---|---|---|
| Syntax check (AST parse) | Syntax errors | Logic bugs, security issues | ~1ms | All use cases |
| Static analysis (linting) | Style, common bugs | Runtime errors, logic errors | ~10ms | Code quality gates |
| Sandbox execution | Runtime errors, panics | Logic correctness, security | ~500ms | Auto-run pipelines |
| LLM review | Logic bugs, security | Performance, subtle edge cases | ~1–5s | High-stakes code |
| Unit tests (generated) | Functional correctness | Edge cases without coverage | ~1–10s | Library generation |
Layered validation pipelines combine multiple strategies in sequence. A fast syntax check runs first as a gate; code that passes syntax validation is then linted; code passing lint is executed in a sandbox to check for runtime errors. Only code passing all automated checks proceeds to optional LLM review for logic correctness. This layered approach catches the majority of errors with cheap early-stage checks and reserves expensive validation for code that has already passed simpler gates.
For security-critical code generation tasks — generating SQL queries, shell scripts, or network request handlers — static analysis should include security-focused rules such as SQL injection detection, unsafe deserialization patterns, and hardcoded credentials. Tools like Bandit (Python), Semgrep, and CodeQL can be integrated into validation pipelines to scan for known vulnerability patterns before code is executed or reviewed by humans.
Test generation as a validation strategy asks the LLM to generate unit tests alongside the code it produces, then executes those tests against the code in a sandbox. If the generated code passes its own generated tests, this provides moderate confidence in functional correctness — although the tests themselves may have gaps. A stronger variant asks a separate LLM invocation to critique the tests for coverage and then augment them with edge cases before running the full suite.
Streaming validation — checking code incrementally as it is generated rather than waiting for the full output — is possible for structural constraints like syntax and style, but not for behavioral correctness that requires executing the complete function. Streaming syntax checks can abort generation early when a syntax error is detected, saving tokens and reducing latency compared to completing the full generation and then rejecting it. This is particularly valuable for long code generation tasks where errors in the function signature or early structure would invalidate the entire output.