Prompt Engineering

Code Output Validation

Techniques for reliably extracting, parsing, and executing LLM-generated code while handling syntax errors, unsafe patterns, and test failures.

Parse → sandbox → test
Validation pipeline
AST
Safe analysis
Retry on
Failure

Table of Contents

SECTION 01

Why LLM code needs validation

LLMs generate plausible-looking code. "Plausible-looking" is not the same as "correct" or "safe". Common failure modes:

A validation pipeline catches these issues before the code reaches production or an end user.

SECTION 02

Extraction: getting clean code from output

LLMs often wrap code in markdown fences, add explanatory prose, or include multiple snippets. Extract reliably:

import re

def extract_code(response: str, language: str = "python") -> str:
    '''Extract the first code block from a markdown-formatted LLM response.'''
    # Try fenced code block first
    pattern = rf"```{language}
(.*?)```"
    matches = re.findall(pattern, response, re.DOTALL | re.IGNORECASE)
    if matches:
        return matches[0].strip()

    # Fallback: any fenced block
    pattern = r"```\w*
(.*?)```"
    matches = re.findall(pattern, response, re.DOTALL)
    if matches:
        return matches[0].strip()

    # Last resort: entire response (assume it's just code)
    return response.strip()

# Usage
raw = """
Here's the implementation:
```python
def fibonacci(n: int) -> list[int]:
    a, b = 0, 1
    result = []
    for _ in range(n):
        result.append(a)
        a, b = b, a + b
    return result
```
"""
code = extract_code(raw)
SECTION 03

Syntax validation with AST

Before executing anything, parse the code into an Abstract Syntax Tree. This catches syntax errors without running any code:

import ast

def validate_syntax(code: str) -> tuple[bool, str]:
    '''Returns (is_valid, error_message).'''
    try:
        ast.parse(code)
        return True, ""
    except SyntaxError as e:
        return False, f"Syntax error on line {e.lineno}: {e.msg}"

# Example
code = "def foo(x)
    return x * 2"  # missing colon
valid, error = validate_syntax(code)
print(valid, error)
# False   Syntax error on line 1: expected ':'

You can also walk the AST to detect unsafe patterns without executing:

BANNED_CALLS = {"eval", "exec", "compile", "__import__", "open"}
BANNED_ATTRS = {"system", "popen"}  # os.system, os.popen

def contains_unsafe_patterns(code: str) -> list[str]:
    tree = ast.parse(code)
    issues = []
    for node in ast.walk(tree):
        if isinstance(node, ast.Call):
            if isinstance(node.func, ast.Name) and node.func.id in BANNED_CALLS:
                issues.append(f"Banned call: {node.func.id}()")
            if isinstance(node.func, ast.Attribute) and node.func.attr in BANNED_ATTRS:
                issues.append(f"Banned attribute: .{node.func.attr}()")
    return issues
SECTION 04

Safe execution sandboxing

For controlled execution, use RestrictedPython or a subprocess with resource limits:

import subprocess, sys, tempfile, os

def execute_sandboxed(code: str, timeout: int = 5) -> tuple[str, str, int]:
    '''
    Execute code in a subprocess with timeout.
    Returns (stdout, stderr, returncode).
    '''
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(code)
        tmp_path = f.name

    try:
        result = subprocess.run(
            [sys.executable, tmp_path],
            capture_output=True,
            text=True,
            timeout=timeout,
        )
        return result.stdout, result.stderr, result.returncode
    except subprocess.TimeoutExpired:
        return "", "Execution timed out", 1
    finally:
        os.unlink(tmp_path)

stdout, stderr, rc = execute_sandboxed("print(sum(range(10)))")
print(stdout)   # 45
print(rc)       # 0

For higher security (production LLM code execution), use Docker containers or cloud sandboxes (AWS Lambda, Firecracker) that enforce filesystem and network isolation at the OS level.

SECTION 05

Semantic validation with tests

Syntax is necessary but not sufficient. Validate the behaviour with test cases:

def validate_with_tests(code: str, test_cases: list[dict]) -> tuple[bool, list[str]]:
    '''
    Run generated code against test cases.
    test_cases: [{"input": ..., "expected": ..., "description": "..."}]
    Returns (all_passed, [failure_messages])
    '''
    # Compile the function into a namespace
    namespace = {}
    exec(compile(code, "", "exec"), namespace)

    failures = []
    for tc in test_cases:
        # Assumes the code defines a function named 'solution'
        func_name = [k for k in namespace if callable(namespace[k])][0]
        func = namespace[func_name]
        try:
            result = func(*tc["input"]) if isinstance(tc["input"], tuple) else func(tc["input"])
            if result != tc["expected"]:
                failures.append(
                    f"{tc['description']}: expected {tc['expected']!r}, got {result!r}"
                )
        except Exception as e:
            failures.append(f"{tc['description']}: raised {type(e).__name__}: {e}")

    return len(failures) == 0, failures
SECTION 06

Retry loop pattern

import anthropic

client = anthropic.Anthropic()

def generate_and_validate(task: str, test_cases: list[dict], max_retries: int = 3) -> str:
    messages = [{"role": "user", "content": f"Write a Python function that: {task}\nReturn ONLY the function code, no explanation."}]

    for attempt in range(max_retries):
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=messages
        )
        code = extract_code(response.content[0].text)

        # Stage 1: syntax
        valid, err = validate_syntax(code)
        if not valid:
            messages.append({"role": "assistant", "content": response.content[0].text})
            messages.append({"role": "user", "content": f"The code has a syntax error: {err}\nPlease fix it and return ONLY the corrected code."})
            continue

        # Stage 2: safety
        issues = contains_unsafe_patterns(code)
        if issues:
            messages.append({"role": "assistant", "content": response.content[0].text})
            messages.append({"role": "user", "content": f"The code contains unsafe patterns: {issues}\nPlease rewrite without these."})
            continue

        # Stage 3: tests
        passed, failures = validate_with_tests(code, test_cases)
        if not passed:
            messages.append({"role": "assistant", "content": response.content[0].text})
            messages.append({"role": "user", "content": f"The code fails these test cases:\n" + "\n".join(failures) + "\nPlease fix the implementation."})
            continue

        return code  # All checks passed

    raise ValueError(f"Failed to generate valid code after {max_retries} attempts")
SECTION 07

Security checklist

Before deploying any LLM code execution pipeline, verify:

The AST-based checks above are a defence-in-depth layer, not a replacement for OS-level sandboxing.

Validation Strategy Comparison

Code output validation ensures that LLM-generated code meets functional and safety requirements before execution in production environments. The appropriate validation strategy depends on the risk profile of the use case — code run in a sandbox for exploration has different requirements than code automatically deployed to a production pipeline.

StrategyCatchesMissesLatencyBest For
Syntax check (AST parse)Syntax errorsLogic bugs, security issues~1msAll use cases
Static analysis (linting)Style, common bugsRuntime errors, logic errors~10msCode quality gates
Sandbox executionRuntime errors, panicsLogic correctness, security~500msAuto-run pipelines
LLM reviewLogic bugs, securityPerformance, subtle edge cases~1–5sHigh-stakes code
Unit tests (generated)Functional correctnessEdge cases without coverage~1–10sLibrary generation

Layered validation pipelines combine multiple strategies in sequence. A fast syntax check runs first as a gate; code that passes syntax validation is then linted; code passing lint is executed in a sandbox to check for runtime errors. Only code passing all automated checks proceeds to optional LLM review for logic correctness. This layered approach catches the majority of errors with cheap early-stage checks and reserves expensive validation for code that has already passed simpler gates.

For security-critical code generation tasks — generating SQL queries, shell scripts, or network request handlers — static analysis should include security-focused rules such as SQL injection detection, unsafe deserialization patterns, and hardcoded credentials. Tools like Bandit (Python), Semgrep, and CodeQL can be integrated into validation pipelines to scan for known vulnerability patterns before code is executed or reviewed by humans.

Test generation as a validation strategy asks the LLM to generate unit tests alongside the code it produces, then executes those tests against the code in a sandbox. If the generated code passes its own generated tests, this provides moderate confidence in functional correctness — although the tests themselves may have gaps. A stronger variant asks a separate LLM invocation to critique the tests for coverage and then augment them with edge cases before running the full suite.

Streaming validation — checking code incrementally as it is generated rather than waiting for the full output — is possible for structural constraints like syntax and style, but not for behavioral correctness that requires executing the complete function. Streaming syntax checks can abort generation early when a syntax error is detected, saving tokens and reducing latency compared to completing the full generation and then rejecting it. This is particularly valuable for long code generation tasks where errors in the function signature or early structure would invalidate the entire output.