Run your golden test suite on every prompt change before shipping. Catch quality regressions before they reach users. The CI/CD of LLM systems.
Prompt regression testing is the practice of automatically running your golden test suite on every prompt change before merging to main. Just as unit tests catch code regressions, prompt regression tests catch quality degradations — cases where a prompt "improvement" inadvertently breaks previously working behaviour. Without it, you're deploying blind: your new prompt might score better on the 5 examples you tested manually but worse on the 50 edge cases you forgot about.
import asyncio, json, os
from pathlib import Path
async def run_regression_suite(prompt_file: str, golden_set_file: str) -> dict:
import openai
client = openai.AsyncOpenAI()
with open(prompt_file) as f:
system_prompt = f.read()
examples = [json.loads(line) for line in open(golden_set_file)]
results = []
for ex in examples:
resp = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": ex["input"]},
],
temperature=0.0, # deterministic for regression testing
max_tokens=512,
)
output = resp.choices[0].message.content
# Hard assertions
facts_present = all(fact.lower() in output.lower() for fact in ex.get("required_facts", []))
forbidden_absent = not any(f in output.lower() for f in ex.get("forbidden", []))
results.append({
"id": ex["id"],
"passed": facts_present and forbidden_absent,
"output": output,
})
n_passed = sum(r["passed"] for r in results)
return {
"total": len(results),
"passed": n_passed,
"pass_rate": n_passed / len(results),
"failed_ids": [r["id"] for r in results if not r["passed"]],
}
if __name__ == "__main__":
result = asyncio.run(run_regression_suite(
prompt_file="prompts/system_prompt.txt",
golden_set_file="evals/golden/main.jsonl",
))
print(json.dumps(result, indent=2))
# Exit with non-zero code if pass rate drops below threshold
if result["pass_rate"] < 0.90:
print(f"REGRESSION: pass rate {result['pass_rate']:.1%} < 90% threshold")
exit(1)
def check_for_regression(
current_results: dict,
baseline_results: dict,
max_allowed_drop: float = 0.05, # max 5% drop in pass rate
hard_minimum: float = 0.85, # absolute floor
) -> tuple[bool, str]:
current_rate = current_results["pass_rate"]
baseline_rate = baseline_results["pass_rate"]
drop = baseline_rate - current_rate
if current_rate < hard_minimum:
return False, f"Pass rate {current_rate:.1%} below hard minimum {hard_minimum:.1%}"
if drop > max_allowed_drop:
return False, (
f"Regression detected: {baseline_rate:.1%} -> {current_rate:.1%} "
f"(drop of {drop:.1%} exceeds max {max_allowed_drop:.1%})"
)
# Check for category regressions even if overall is fine
for cat, stats in current_results.get("by_category", {}).items():
baseline_cat = baseline_results.get("by_category", {}).get(cat, {})
if baseline_cat:
cat_drop = baseline_cat.get("pass_rate", 0) - stats.get("pass_rate", 0)
if cat_drop > 0.10: # 10% drop in any single category
return False, f"Category regression in '{cat}': {cat_drop:.1%} drop"
return True, f"No regression: {current_rate:.1%} (baseline: {baseline_rate:.1%})"
LLMs are stochastic — the same prompt can produce different outputs. For regression testing, use temperature=0 wherever possible to make tests reproducible. When temperature=0 isn't appropriate (creative tasks), run each test 3–5 times and use the majority verdict. Some evaluation frameworks (DeepEval, Promptfoo) handle this automatically. Report mean and standard deviation across runs, and only flag a regression when the change exceeds 2 standard deviations from the baseline distribution.
# .github/workflows/prompt-regression.yml
name: Prompt Regression Tests
on:
pull_request:
paths:
- 'prompts/**' # only run when prompts change
- 'evals/**'
jobs:
regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install openai pytest
- name: Run regression suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python evals/run_regression.py \
--prompt prompts/system_prompt.txt \
--golden evals/golden/main.jsonl \
--baseline evals/baselines/main_baseline.json \
--threshold 0.90 \
--output regression_results.json
- name: Comment results on PR
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('regression_results.json'));
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '## Regression Results\n' +
'- Pass rate: ' + (results.pass_rate * 100).toFixed(1) + '%\n' +
'- Status: ' + (results.passed_gate ? 'PASS' : 'FAIL'),
});
Different tasks warrant different thresholds:
Set your thresholds based on the cost of a regression in production, not on what's technically achievable.
Prompt regression testing prevents quality degradation when prompts, models, or pipeline configurations change. By maintaining a curated dataset of representative inputs with expected outputs or quality criteria, teams can automatically detect regressions before they reach production — the same principle as unit and integration testing in software engineering, applied to LLM behavior.
| Test Type | Assertion Method | Coverage | Maintenance |
|---|---|---|---|
| Exact match | String equality | Narrow (brittle) | Low |
| Substring check | Contains expected text | Moderate | Low |
| Schema validation | JSON/structure check | Good for structured output | Low |
| LLM-as-judge | LLM scores output | Broad | |
| Human eval baseline | Human approval rate | Broadest | High |
Regression test datasets should evolve alongside the application. When a production failure occurs — the model mishandles a user input in an unexpected way — the triggering example should be added to the regression suite immediately after the fix is deployed. This "fail once, test always" discipline builds a regression dataset that is directly grounded in real failure modes rather than hypothetical test scenarios. Over 6–12 months, this practice produces a highly representative test suite that catches the majority of practically relevant regressions.
CI/CD integration for prompt regression testing runs the full test suite on every pull request that modifies prompts, model configurations, or retrieval pipeline parameters. Failing tests block the merge until the regression is investigated and either fixed or explicitly acknowledged with a test expectation update. Tracking pass rate trends over time surfaces slow degradation patterns — where no single change causes a test failure but a series of small changes gradually shifts behavior — which are otherwise invisible until they accumulate into a user-visible quality problem.
Test case diversity is more important than test case volume for prompt regression suites. One hundred tests that all exercise the same input pattern provide little protection against regressions on different query types. A smaller but diverse set of 30–50 tests spanning different query lengths, complexity levels, domain topics, edge cases, and known past failure modes provides better regression coverage. A coverage analysis that maps each test to the behavioral dimension it exercises helps identify gaps in the test suite before those gaps become production failures.
Non-determinism in LLM outputs complicates regression test assertions. A test that checks for exact string equality will fail on correct outputs that are phrased differently from the expected answer. Using multiple acceptable answer variants, semantic similarity thresholds, or LLM-as-judge evaluation with lenient scoring addresses non-determinism without making tests so loose that they fail to catch real regressions. Temperature 0 (deterministic greedy decoding) can be used in test environments to reduce variance, though this may miss regressions that only manifest at production temperature settings.
Prompt regression suites accumulate over time and require periodic maintenance to remain relevant. Tests written for a previous model version may not meaningfully test the current system if the underlying behavior has changed fundamentally. Quarterly reviews that identify and retire tests with 100% pass rates over 6+ months (stable, well-covered behaviors) and add tests for newly discovered edge cases ensure the suite remains challenging and representative rather than becoming a historical artifact that confirms old capabilities without testing current risks.
Flaky tests in prompt regression suites — tests that pass or fail non-deterministically — erode confidence in the test suite and create alert fatigue when teams learn to ignore intermittent failures. Common causes of flakiness include high-temperature sampling that produces different outputs on each run, date-sensitive prompts that produce different answers as the current date changes, and LLM-as-judge evaluators with high variance on borderline outputs. Fixing flaky tests by using deterministic sampling, date-stable prompts, and calibrated judge thresholds is worth the investment to maintain the suite's reliability as a quality signal.