Prompt Regression

What is prompt regression testing
CI pipeline setup
Regression detection logic
Handling non-determinism
GitHub Actions integration
Alert thresholds
Gotchas

SECTION 01

What is prompt regression testing

Prompt regression testing is the practice of automatically running your golden test suite on every prompt change before merging to main. Just as unit tests catch code regressions, prompt regression tests catch quality degradations — cases where a prompt "improvement" inadvertently breaks previously working behaviour. Without it, you're deploying blind: your new prompt might score better on the 5 examples you tested manually but worse on the 50 edge cases you forgot about.

SECTION 02

CI pipeline setup

import asyncio, json, os
from pathlib import Path

async def run_regression_suite(prompt_file: str, golden_set_file: str) -> dict:
    import openai
    client = openai.AsyncOpenAI()

    with open(prompt_file) as f:
        system_prompt = f.read()

    examples = [json.loads(line) for line in open(golden_set_file)]

    results = []
    for ex in examples:
        resp = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": ex["input"]},
            ],
            temperature=0.0,  # deterministic for regression testing
            max_tokens=512,
        )
        output = resp.choices[0].message.content

        # Hard assertions
        facts_present = all(fact.lower() in output.lower() for fact in ex.get("required_facts", []))
        forbidden_absent = not any(f in output.lower() for f in ex.get("forbidden", []))

        results.append({
            "id": ex["id"],
            "passed": facts_present and forbidden_absent,
            "output": output,
        })

    n_passed = sum(r["passed"] for r in results)
    return {
        "total": len(results),
        "passed": n_passed,
        "pass_rate": n_passed / len(results),
        "failed_ids": [r["id"] for r in results if not r["passed"]],
    }

if __name__ == "__main__":
    result = asyncio.run(run_regression_suite(
        prompt_file="prompts/system_prompt.txt",
        golden_set_file="evals/golden/main.jsonl",
    ))
    print(json.dumps(result, indent=2))
    # Exit with non-zero code if pass rate drops below threshold
    if result["pass_rate"] < 0.90:
        print(f"REGRESSION: pass rate {result['pass_rate']:.1%} < 90% threshold")
        exit(1)

SECTION 03

Regression detection logic

def check_for_regression(
    current_results: dict,
    baseline_results: dict,
    max_allowed_drop: float = 0.05,  # max 5% drop in pass rate
    hard_minimum: float = 0.85,       # absolute floor
) -> tuple[bool, str]:
    current_rate = current_results["pass_rate"]
    baseline_rate = baseline_results["pass_rate"]
    drop = baseline_rate - current_rate

    if current_rate < hard_minimum:
        return False, f"Pass rate {current_rate:.1%} below hard minimum {hard_minimum:.1%}"

    if drop > max_allowed_drop:
        return False, (
            f"Regression detected: {baseline_rate:.1%} -> {current_rate:.1%} "
            f"(drop of {drop:.1%} exceeds max {max_allowed_drop:.1%})"
        )

    # Check for category regressions even if overall is fine
    for cat, stats in current_results.get("by_category", {}).items():
        baseline_cat = baseline_results.get("by_category", {}).get(cat, {})
        if baseline_cat:
            cat_drop = baseline_cat.get("pass_rate", 0) - stats.get("pass_rate", 0)
            if cat_drop > 0.10:  # 10% drop in any single category
                return False, f"Category regression in '{cat}': {cat_drop:.1%} drop"

    return True, f"No regression: {current_rate:.1%} (baseline: {baseline_rate:.1%})"

SECTION 04

Handling non-determinism

LLMs are stochastic — the same prompt can produce different outputs. For regression testing, use temperature=0 wherever possible to make tests reproducible. When temperature=0 isn't appropriate (creative tasks), run each test 3–5 times and use the majority verdict. Some evaluation frameworks (DeepEval, Promptfoo) handle this automatically. Report mean and standard deviation across runs, and only flag a regression when the change exceeds 2 standard deviations from the baseline distribution.

SECTION 05

GitHub Actions integration

# .github/workflows/prompt-regression.yml
name: Prompt Regression Tests

on:
  pull_request:
    paths:
      - 'prompts/**'       # only run when prompts change
      - 'evals/**'

jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install openai pytest

      - name: Run regression suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python evals/run_regression.py \
            --prompt prompts/system_prompt.txt \
            --golden evals/golden/main.jsonl \
            --baseline evals/baselines/main_baseline.json \
            --threshold 0.90 \
            --output regression_results.json

      - name: Comment results on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('regression_results.json'));
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '## Regression Results\n' +
                    '- Pass rate: ' + (results.pass_rate * 100).toFixed(1) + '%\n' +
                    '- Status: ' + (results.passed_gate ? 'PASS' : 'FAIL'),
            });

SECTION 06

Alert thresholds

Different tasks warrant different thresholds:

Hard-fact extraction (dates, names, numbers): 95% minimum — factual errors are high stakes.
Classification (sentiment, intent): 90% minimum — consistent classification matters for downstream logic.
Creative generation (copywriting, summaries): 80% minimum — allow more variation while catching clear regressions.
Safety filtering: 100% required — any regression in safety behaviour must block the deploy.

Set your thresholds based on the cost of a regression in production, not on what's technically achievable.

SECTION 07

Gotchas

Test suite drift: If you add new examples whenever your prompt fails on them (to "fix" the CI), the test suite stops being a regression suite and becomes an overfitting target. New examples should come from production data, not from failing CI runs.
Cost management: Running 500 LLM calls per PR can cost $5–50/run. Use a smaller critical subset (50 examples) for every PR and the full suite nightly. Cache results for unchanged prompt + example combinations.
Slow CI: 500 sequential LLM calls can take 30+ minutes. Run evals in parallel (asyncio, ThreadPoolExecutor). Most eval frameworks support concurrent evaluation.

Prompt Regression Testing Framework

Prompt regression testing prevents quality degradation when prompts, models, or pipeline configurations change. By maintaining a curated dataset of representative inputs with expected outputs or quality criteria, teams can automatically detect regressions before they reach production — the same principle as unit and integration testing in software engineering, applied to LLM behavior.

Medium

Test Type	Assertion Method	Coverage	Maintenance
Exact match	String equality	Narrow (brittle)	Low
Substring check	Contains expected text	Moderate	Low
Schema validation	JSON/structure check	Good for structured output	Low
LLM-as-judge	LLM scores output	Broad
Human eval baseline	Human approval rate	Broadest	High

Regression test datasets should evolve alongside the application. When a production failure occurs — the model mishandles a user input in an unexpected way — the triggering example should be added to the regression suite immediately after the fix is deployed. This "fail once, test always" discipline builds a regression dataset that is directly grounded in real failure modes rather than hypothetical test scenarios. Over 6–12 months, this practice produces a highly representative test suite that catches the majority of practically relevant regressions.

CI/CD integration for prompt regression testing runs the full test suite on every pull request that modifies prompts, model configurations, or retrieval pipeline parameters. Failing tests block the merge until the regression is investigated and either fixed or explicitly acknowledged with a test expectation update. Tracking pass rate trends over time surfaces slow degradation patterns — where no single change causes a test failure but a series of small changes gradually shifts behavior — which are otherwise invisible until they accumulate into a user-visible quality problem.

Test case diversity is more important than test case volume for prompt regression suites. One hundred tests that all exercise the same input pattern provide little protection against regressions on different query types. A smaller but diverse set of 30–50 tests spanning different query lengths, complexity levels, domain topics, edge cases, and known past failure modes provides better regression coverage. A coverage analysis that maps each test to the behavioral dimension it exercises helps identify gaps in the test suite before those gaps become production failures.

Non-determinism in LLM outputs complicates regression test assertions. A test that checks for exact string equality will fail on correct outputs that are phrased differently from the expected answer. Using multiple acceptable answer variants, semantic similarity thresholds, or LLM-as-judge evaluation with lenient scoring addresses non-determinism without making tests so loose that they fail to catch real regressions. Temperature 0 (deterministic greedy decoding) can be used in test environments to reduce variance, though this may miss regressions that only manifest at production temperature settings.

Prompt regression suites accumulate over time and require periodic maintenance to remain relevant. Tests written for a previous model version may not meaningfully test the current system if the underlying behavior has changed fundamentally. Quarterly reviews that identify and retire tests with 100% pass rates over 6+ months (stable, well-covered behaviors) and add tests for newly discovered edge cases ensure the suite remains challenging and representative rather than becoming a historical artifact that confirms old capabilities without testing current risks.

Flaky tests in prompt regression suites — tests that pass or fail non-deterministically — erode confidence in the test suite and create alert fatigue when teams learn to ignore intermittent failures. Common causes of flakiness include high-temperature sampling that produces different outputs on each run, date-sensitive prompts that produce different answers as the current date changes, and LLM-as-judge evaluators with high variance on borderline outputs. Fixing flaky tests by using deterministic sampling, date-stable prompts, and calibrated judge thresholds is worth the investment to maintain the suite's reliability as a quality signal.