System-Level Eval

Component vs System Eval
Designing System Test Cases
Trace-Based Evaluation
Failure Mode Taxonomy
CI Integration
Metrics Pyramid

SECTION 01

Component vs System Eval

A retrieval component scoring 90% precision and a generation model scoring 85% quality don't guarantee an 80%+ end-to-end system. Errors compound: bad retrieval feeds bad context to generation; correct generation on wrong context still fails. System eval measures the only thing that matters: does the full pipeline produce the right answer for the user?

SECTION 02

Designing System Test Cases

Use a golden set of 100–500 (query, expected_output) pairs covering your top user intents. Include: adversarial queries, out-of-scope requests, ambiguous inputs, multi-turn dialogues, edge cases from production failures. Score with: exact match (structured outputs), LLM judge (free-form), task completion rate (for agentic systems), and human review (10% random sample).

SECTION 03

Trace-Based Evaluation

Capture full execution traces for every eval run. Each trace records inputs/outputs at each step: " "retriever (query, docs, scores), reranker (input docs, output docs), generator (context, prompt, response). " "Trace evaluation lets you pinpoint where quality broke down — retrieval miss vs generation error.

import json
from dataclasses import dataclass, field
from typing import Any
@dataclass
class EvalTrace:
    query: str
    steps: list = field(default_factory=list)
    final_output: str = ""
    scores: dict = field(default_factory=dict)
def add_step(self, name: str, inputs: Any, outputs: Any, latency_ms: float):
        self.steps.append({
            "name": name, "inputs": inputs,
            "outputs": outputs, "latency_ms": latency_ms
        })
def to_json(self) -> str:
        return json.dumps({"query": self.query, "steps": self.steps,
                           "final_output": self.final_output, "scores": self.scores})

SECTION 04

Failure Mode Taxonomy

Categorise failures: (1) Retrieval miss — correct answer not in retrieved docs. (2) Context truncation — answer was retrieved but cut off in prompt. (3) Hallucination — model ignores context and fabricates. (4) Format failure — correct content but wrong structure (expected JSON, got prose). (5) Latency SLA breach — correct answer but too slow. Tagging failures by type guides where to invest improvement effort.

SECTION 05

CI Integration

Run system eval as a required CI check: block merges if top-line pass rate drops >2pp, or if any failure category increases >5pp. Use a fast subset (25–50 cases) for PR checks, full suite nightly. Cache model responses when testing non-model changes to speed up runs.

SECTION 06

Metrics Pyramid

Top: business metric (user CSAT, task completion). Middle: system metric (end-to-end accuracy, hallucination rate). Bottom: component metrics (retrieval recall, generator fluency). Optimise bottom metrics only when they correlate with top — otherwise you're measuring the wrong thing.

End-to-End Test Automation

System evaluation requires comprehensive test case design covering happy paths, edge cases, and failure modes. Test coverage for LLM systems typically includes input variants (language, length, toxicity), system state variations (cold start, cache hit), and multi-turn interactions. Automation frameworks execute thousands of test cases daily, tracking regression through version control.

# System evaluation test harness
class SystemEvaluation:
    def __init__(self, model_endpoint):
        self.endpoint = model_endpoint
    
    def run_test_suite(self, test_cases):
        results = []
        for test in test_cases:
            try:
                response = self.endpoint.query(
                    test['prompt'],
                    temperature=test.get('temperature', 0.7),
                    max_tokens=test.get('max_tokens', 1000)
                )
                
                # Multi-criteria evaluation
                passed = all([
                    self._check_response_length(response, test),
                    self._check_safety(response, test),
                    self._check_factuality(response, test),
                    self._check_output_format(response, test)
                ])
                
                results.append({
                    'test_id': test['id'],
                    'passed': passed,
                    'latency_ms': response.latency,
                    'tokens_used': response.token_count
                })
            except Exception as e:
                results.append({'test_id': test['id'], 'error': str(e)})
        
        return results

Test Category	Test Cases	Frequency	Typical Pass Rate
Functionality	50-200	Every commit	>99%
Regression	100-500	Nightly	>95%
Edge Cases	20-100	Weekly	80-90%
Safety/Jailbreak	50-200	Before release	>98%

# Comprehensive failure mode taxonomy
FAILURE_TAXONOMY = {
    "hallucination": {
        "description": "Factually incorrect or fabricated information",
        "detection": "Compare to knowledge base or external source",
        "severity": "critical"
    },
    "refusal": {
        "description": "Inappropriate refusal to answer legitimate requests",
        "detection": "Manual review of false positives",
        "severity": "medium"
    },
    "latency": {
        "description": "Response takes >5 seconds",
        "detection": "Automated timeout tracking",
        "severity": "medium"
    },
    "format_violation": {
        "description": "Output doesn't match required schema",
        "detection": "JSON schema validation",
        "severity": "high"
    }
}

Continuous Integration and Deployment Gating

CI/CD pipelines gate model deployments on evaluation metrics. Before production release, models must pass: 99%+ functionality tests, >95% on maintained regression suites, and <2% safety violation rate. Using staged rollouts with canary deployments to 5-10% of traffic enables real-time monitoring before full release.

Comprehensive system evaluation requires multi-layer test coverage spanning functionality, performance, safety, and user experience. Functional tests verify correct behavior: prompt → response matching expected patterns, specified format schemas, correct tool usage. Performance tests establish baselines: response latency under 500ms at p95, token generation rate greater than 50 tokens/second, memory usage under 8GB concurrent. Safety tests ensure harm prevention: jailbreak resistance (custom attacks + known exploit database), hallucination rate under 2%, refusal false positive rate under 5%. User experience tests measure practical quality: task completion rates in simulated user scenarios, information retrieval success (answer relevance), dialogue coherence across multi-turn conversations. Regression test suites maintain test coverage across model versions: maintain 100+ Golden tests (hand-curated high-quality examples) that must never regress. Automated flakiness detection identifies tests failing randomly due to model stochasticity, excluding those from critical gates. Continuous integration gates prevent model degradation: block deployment if any metric regresses greater than 5% from baseline. For LLM systems specifically, evaluation must incorporate domain expertise: medical outputs require clinician review, code outputs need automated testing, legal documents need compliance verification.

Test case design for LLM systems requires careful consideration of model stochasticity and distribution shift. Golden tests are hand-curated high-quality examples representing system capabilities: these should never regress and form the "smoke test" suite run before any deployment. Regression tests catch known failure modes: historical examples where model failed, regression suite ensures these don't reappear. Robustness tests probe edge cases: malformed input, boundary conditions, adversarial examples. Interaction tests verify multi-turn correctness: dialogue coherence, context retention across turns, proper information recall. Format tests validate output structure: required fields present, values in valid ranges, JSON/XML schema compliance. Safety tests prevent harmful outputs: prompt injection attempts, jailbreak attempts, content policy violations. Each test category has clear pass/fail criteria: format tests are binary (pass/fail), quality tests often use rubrics (score 1-5). Automated grading uses classifiers trained on human labels: code tests execute and check correctness, factuality tests query knowledge bases, coherence tests use cross-encoder similarity scores. Testing coverage metrics track: test case count per category, pass rate per category, regression detection rate. Industrial systems maintain >500 test cases per product, running full suite before each deployment (takes <1 hour with parallelization). Test generation automation: use LLMs to generate test cases based on model documentation, human curators review and refine. Continuous testing: run full suite daily, maintain regression database, alert on unexpected failures.

CI/CD integration for LLM systems creates deployment gates based on evaluation metrics. Standard gate configuration: functionality tests must pass 99%+, regression suite must pass >95%, safety tests must pass 99.5%+, performance tests must meet latency SLA (p95 <500ms for typical systems). Canary deployment stages: 1% traffic (1000-5000 users) for 12 hours, 5% traffic for 12 hours, 25% traffic for 24 hours, 100% traffic. Metrics tracked at each stage: business metrics (engagement, retention, satisfaction), performance metrics (latency, throughput, error rate), safety metrics (content policy violations, user reports). Automated rollback triggers: if any metric regresses >5% from baseline, immediately revert. Manual approval required for proceeding to next stage if any metric shows concerning trend. Canary analysis compares metrics between canary and stable groups, controlling for time-of-day and day-of-week effects using covariate balancing. Common test failures and resolutions: hallucination spike (reduce temperature, retrain), refusal rate increase (tune instruction tuning, adjust safety threshold), latency regression (optimize inference, investigate hardware issues). Post-deployment monitoring: track all metrics continuously, send alerts if any metric drifts significantly, maintain dashboard of production system health. Incident response: document failures, root-cause analysis, preventive measures, update test suite to catch similar issues.

Failure analysis and root cause diagnosis in LLM systems requires systematic investigation of user reports and errors. Automated clustering groups similar failures: syntactic similarity (edit distance) for format errors, semantic similarity (embeddings) for hallucinations. High-frequency failure clusters receive priority: if 100 users report model saying "I am not permitted to..." on legitimate requests, indicates overly-strict safety filters. Root cause analysis investigates: training data (model never saw this topic), model capacity (too small to handle nuance), prompt engineering (ambiguous instruction), safety filters (too aggressive). Debugging tools: introspection mode that shows reasoning steps and confidence scores, gradual decoding that tracks each token generation, attention visualization showing which input parts influenced outputs. Common failure patterns: (1) Information cutoff (model has no training data post-2023), (2) Numerical reasoning failures (arithmetic, counting, physics), (3) Long-context degradation (performance drops on 10K+ token contexts), (4) Instruction interpretation (ignoring formatting constraints). Prevention measures: synthetic test cases targeting known failure modes, prompt engineering guides for users (how to get better outputs), model updates addressing systematic issues. Monitoring post-deployment: sample 1% of outputs for manual review, flag concerning patterns (new failure type, spike in error rate), maintain failure taxonomy. Feedback loops: high-frequency failures trigger model improvements, documentation updates (warn users about limitations), filter updates (prevent known jailbreaks).

Eval level	Scope	Frequency	Latency
Unit (component)	Single LLM call	Every commit	Seconds
Integration	Multi-step pipeline	Every PR merge	Minutes
End-to-end	Full user flow	Pre-release	Hours
Regression suite	Known failure cases	Every deploy	Minutes