Code Output Validation

Why LLM code needs validation
Extraction: getting clean code from output
Syntax validation with AST
Safe execution sandboxing
Semantic validation with tests
Retry loop pattern
Security checklist

SECTION 01

Why LLM code needs validation

LLMs generate plausible-looking code. "Plausible-looking" is not the same as "correct" or "safe". Common failure modes:

Syntax errors — missing colon, mismatched brackets, hallucinated method names.
Import hallucination — from sklearn.utils import magical_function that doesn't exist.
Logic bugs — the code runs but produces wrong output on edge cases.
Unsafe patterns — eval(), exec(), subprocess.call(user_input), file deletion.

A validation pipeline catches these issues before the code reaches production or an end user.

SECTION 02

Extraction: getting clean code from output

LLMs often wrap code in markdown fences, add explanatory prose, or include multiple snippets. Extract reliably:

import re

def extract_code(response: str, language: str = "python") -> str:
    '''Extract the first code block from a markdown-formatted LLM response.'''
    # Try fenced code block first
    pattern = rf"```{language}
(.*?)```"
    matches = re.findall(pattern, response, re.DOTALL | re.IGNORECASE)
    if matches:
        return matches[0].strip()

    # Fallback: any fenced block
    pattern = r"```\w*
(.*?)```"
    matches = re.findall(pattern, response, re.DOTALL)
    if matches:
        return matches[0].strip()

    # Last resort: entire response (assume it's just code)
    return response.strip()

# Usage
raw = """
Here's the implementation:
```python
def fibonacci(n: int) -> list[int]:
    a, b = 0, 1
    result = []
    for _ in range(n):
        result.append(a)
        a, b = b, a + b
    return result
```
"""
code = extract_code(raw)

SECTION 03

Syntax validation with AST

Before executing anything, parse the code into an Abstract Syntax Tree. This catches syntax errors without running any code:

import ast

def validate_syntax(code: str) -> tuple[bool, str]:
    '''Returns (is_valid, error_message).'''
    try:
        ast.parse(code)
        return True, ""
    except SyntaxError as e:
        return False, f"Syntax error on line {e.lineno}: {e.msg}"

# Example
code = "def foo(x)
    return x * 2"  # missing colon
valid, error = validate_syntax(code)
print(valid, error)
# False   Syntax error on line 1: expected ':'

You can also walk the AST to detect unsafe patterns without executing:

BANNED_CALLS = {"eval", "exec", "compile", "__import__", "open"}
BANNED_ATTRS = {"system", "popen"}  # os.system, os.popen

def contains_unsafe_patterns(code: str) -> list[str]:
    tree = ast.parse(code)
    issues = []
    for node in ast.walk(tree):
        if isinstance(node, ast.Call):
            if isinstance(node.func, ast.Name) and node.func.id in BANNED_CALLS:
                issues.append(f"Banned call: {node.func.id}()")
            if isinstance(node.func, ast.Attribute) and node.func.attr in BANNED_ATTRS:
                issues.append(f"Banned attribute: .{node.func.attr}()")
    return issues

SECTION 04

Safe execution sandboxing

For controlled execution, use RestrictedPython or a subprocess with resource limits:

import subprocess, sys, tempfile, os

def execute_sandboxed(code: str, timeout: int = 5) -> tuple[str, str, int]:
    '''
    Execute code in a subprocess with timeout.
    Returns (stdout, stderr, returncode).
    '''
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(code)
        tmp_path = f.name

    try:
        result = subprocess.run(
            [sys.executable, tmp_path],
            capture_output=True,
            text=True,
            timeout=timeout,
        )
        return result.stdout, result.stderr, result.returncode
    except subprocess.TimeoutExpired:
        return "", "Execution timed out", 1
    finally:
        os.unlink(tmp_path)

stdout, stderr, rc = execute_sandboxed("print(sum(range(10)))")
print(stdout)   # 45
print(rc)       # 0

For higher security (production LLM code execution), use Docker containers or cloud sandboxes (AWS Lambda, Firecracker) that enforce filesystem and network isolation at the OS level.

SECTION 05

Semantic validation with tests

Syntax is necessary but not sufficient. Validate the behaviour with test cases:

def validate_with_tests(code: str, test_cases: list[dict]) -> tuple[bool, list[str]]:
    '''
    Run generated code against test cases.
    test_cases: [{"input": ..., "expected": ..., "description": "..."}]
    Returns (all_passed, [failure_messages])
    '''
    # Compile the function into a namespace
    namespace = {}
    exec(compile(code, "", "exec"), namespace)

    failures = []
    for tc in test_cases:
        # Assumes the code defines a function named 'solution'
        func_name = [k for k in namespace if callable(namespace[k])][0]
        func = namespace[func_name]
        try:
            result = func(*tc["input"]) if isinstance(tc["input"], tuple) else func(tc["input"])
            if result != tc["expected"]:
                failures.append(
                    f"{tc['description']}: expected {tc['expected']!r}, got {result!r}"
                )
        except Exception as e:
            failures.append(f"{tc['description']}: raised {type(e).__name__}: {e}")

    return len(failures) == 0, failures

SECTION 06

Retry loop pattern

import anthropic

client = anthropic.Anthropic()

def generate_and_validate(task: str, test_cases: list[dict], max_retries: int = 3) -> str:
    messages = [{"role": "user", "content": f"Write a Python function that: {task}\nReturn ONLY the function code, no explanation."}]

    for attempt in range(max_retries):
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=messages
        )
        code = extract_code(response.content[0].text)

        # Stage 1: syntax
        valid, err = validate_syntax(code)
        if not valid:
            messages.append({"role": "assistant", "content": response.content[0].text})
            messages.append({"role": "user", "content": f"The code has a syntax error: {err}\nPlease fix it and return ONLY the corrected code."})
            continue

        # Stage 2: safety
        issues = contains_unsafe_patterns(code)
        if issues:
            messages.append({"role": "assistant", "content": response.content[0].text})
            messages.append({"role": "user", "content": f"The code contains unsafe patterns: {issues}\nPlease rewrite without these."})
            continue

        # Stage 3: tests
        passed, failures = validate_with_tests(code, test_cases)
        if not passed:
            messages.append({"role": "assistant", "content": response.content[0].text})
            messages.append({"role": "user", "content": f"The code fails these test cases:\n" + "\n".join(failures) + "\nPlease fix the implementation."})
            continue

        return code  # All checks passed

    raise ValueError(f"Failed to generate valid code after {max_retries} attempts")

SECTION 07

Security checklist

Before deploying any LLM code execution pipeline, verify:

Filesystem isolation: the sandbox cannot write to your production filesystem or read sensitive config files.
Network isolation: generated code cannot make outbound HTTP calls (data exfiltration, SSRF).
CPU/memory limits: a malicious infinite loop or memory bomb won't take down your server. Use resource.setrlimit or container limits.
No secret injection: your API keys, DB credentials, and env vars are not available inside the sandbox.
Execution timeout: always set a hard timeout. 5–10 seconds is usually sufficient; 30 seconds max for data processing.

The AST-based checks above are a defence-in-depth layer, not a replacement for OS-level sandboxing.

Validation Strategy Comparison

Code output validation ensures that LLM-generated code meets functional and safety requirements before execution in production environments. The appropriate validation strategy depends on the risk profile of the use case — code run in a sandbox for exploration has different requirements than code automatically deployed to a production pipeline.

Strategy	Catches	Misses	Latency	Best For
Syntax check (AST parse)	Syntax errors	Logic bugs, security issues	~1ms	All use cases
Static analysis (linting)	Style, common bugs	Runtime errors, logic errors	~10ms	Code quality gates
Sandbox execution	Runtime errors, panics	Logic correctness, security	~500ms	Auto-run pipelines
LLM review	Logic bugs, security	Performance, subtle edge cases	~1–5s	High-stakes code
Unit tests (generated)	Functional correctness	Edge cases without coverage	~1–10s	Library generation

Layered validation pipelines combine multiple strategies in sequence. A fast syntax check runs first as a gate; code that passes syntax validation is then linted; code passing lint is executed in a sandbox to check for runtime errors. Only code passing all automated checks proceeds to optional LLM review for logic correctness. This layered approach catches the majority of errors with cheap early-stage checks and reserves expensive validation for code that has already passed simpler gates.

For security-critical code generation tasks — generating SQL queries, shell scripts, or network request handlers — static analysis should include security-focused rules such as SQL injection detection, unsafe deserialization patterns, and hardcoded credentials. Tools like Bandit (Python), Semgrep, and CodeQL can be integrated into validation pipelines to scan for known vulnerability patterns before code is executed or reviewed by humans.

Test generation as a validation strategy asks the LLM to generate unit tests alongside the code it produces, then executes those tests against the code in a sandbox. If the generated code passes its own generated tests, this provides moderate confidence in functional correctness — although the tests themselves may have gaps. A stronger variant asks a separate LLM invocation to critique the tests for coverage and then augment them with edge cases before running the full suite.

Streaming validation — checking code incrementally as it is generated rather than waiting for the full output — is possible for structural constraints like syntax and style, but not for behavioral correctness that requires executing the complete function. Streaming syntax checks can abort generation early when a syntax error is detected, saving tokens and reducing latency compared to completing the full generation and then rejecting it. This is particularly valuable for long code generation tasks where errors in the function signature or early structure would invalidate the entire output.