Online Evaluation

Why Online Eval?
Signal Sources
Lightweight Judges
Sampling Strategies
Alerting & Dashboards
Practical Patterns

SECTION 01

Why Online Eval?

Offline evals catch known failure modes; online eval catches unknown ones in production traffic. Every request is a potential signal: did the user copy the answer? Did they thumbs-up or follow up with 'that's wrong'? Did the latency spike correlate with quality drops? Online eval turns your production system into a continuous evaluation harness.

SECTION 02

Signal Sources

Implicit signals come from user behaviour: regeneration clicks, copy-paste actions, session abandonment, " "follow-up corrections ('actually…'), and downstream task completion. Explicit signals are thumbs up/down, " "star ratings, or CSAT surveys. Automated signals come from lightweight classifiers run on every response: " "a toxicity filter, a relevance scorer, a hallucination detector. Each signal type has different coverage " "and noise levels; combine them for robust monitoring.

Example implicit signal collector:
from fastapi import Request
import time
async def track_response(request: Request, response_id: str, content: str):
    # Store response for later correlation with user actions
    await db.store_response(response_id, {
        "content": content,
        "timestamp": time.time(),
        "session_id": request.headers.get("X-Session-Id"),
    })
async def track_user_action(response_id: str, action: str):
    # action: 'copy', 'regenerate', 'thumbs_up', 'thumbs_down', 'abandon'
    resp = await db.get_response(response_id)
    score = {"copy": 0.8, "regenerate": -0.5, "thumbs_up": 1.0, "thumbs_down": -1.0, "abandon": -0.3}
    await db.store_signal(response_id, action, score.get(action, 0))

SECTION 03

Lightweight Judges

An LLM-as-judge that runs on every request must be cheap: use a small model (GPT-4o-mini, Haiku) " "with a tight prompt. Score 1–3 on a single axis per call. Run asynchronously so latency isn't affected.

from openai import AsyncOpenAI
import asyncio
client = AsyncOpenAI()
async def judge_relevance(query: str, response: str) -> float:
    result = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                f"Rate relevance of this response to the query on a scale 1-3.\n"
                f"Query: {query}\nResponse: {response[:500]}\n"
                f"Reply with just the number."
            )
        }],
        max_tokens=5,
        temperature=0,
    )
    try:
        return float(result.choices[0].message.content.strip()) / 3.0
    except ValueError:
        return 0.5
async def eval_async(query: str, response: str, response_id: str):
    score = await judge_relevance(query, response)
    await db.store_signal(response_id, "judge_relevance", score)

SECTION 04

Sampling Strategies

Evaluating every single request with an LLM judge is costly. Use stratified sampling: always evaluate flagged requests (toxicity triggered, very long, low-confidence), randomly sample 5–10% of normal traffic, and oversample new prompt versions or A/B test variants. Reservoir sampling ensures you don't bias toward early-session requests.

SECTION 05

Alerting & Dashboards

Track rolling averages over 1-hour and 24-hour windows. Alert when: judge score drops >10% vs baseline, thumbs-down rate exceeds 15%, or implicit abandon rate spikes. Use a time-series DB (InfluxDB, Prometheus) and a dashboard (Grafana) with separate panels for each signal type. Correlate quality signals with deployment events — a sudden drop after a deploy is a rollback trigger.

SECTION 06

Practical Patterns

Decouple eval from serving: write to a queue (Kafka, SQS), consume asynchronously. Store raw responses for retrospective eval with newer judges. Use shadow mode: run a new judge in parallel without acting on results, validate it matches human labels before trusting it for alerting. Periodically send 1% of production traffic to human reviewers for ground-truth calibration.

Real-Time Feedback Integration

Online evaluation systems continuously collect implicit and explicit signals from users. Implicit signals (latency, retry rate, click-through) provide volume at scale, while explicit ratings (thumbs up/down, detailed surveys) offer high-quality labels on smaller subsets. Combining both through multi-armed bandit frameworks optimizes exploration-exploitation tradeoffs.

Statistical Significance Testing

Online experiments require proper statistical rigor. Sample size calculation for detecting 1-2% improvements typically requires 100K-1M samples per variant. Sequential probability ratio testing (SPRT) allows early stopping while maintaining false positive rates below 5%, accelerating iteration cycles.

Implementing online evaluation systems requires careful architectural decisions balancing real-time feedback with statistical rigor. Feedback ingestion systems handle high-volume signals (millions per day) through asynchronous processing pipelines using message queues (Kafka, RabbitMQ). Real-time aggregation using time-series databases (ClickHouse, TimescaleDB) enables dashboard computation at seconds-level latency. Multi-armed bandit algorithms (Thompson sampling, upper confidence bound) optimize exploration-exploitation: allocate 95% of traffic to top-performing variants while continuously testing 5% against challengers. Sample size calculations for detecting 1% improvements require 500K-1M impressions per variant, achievable in 1-3 days for high-traffic systems. Statistical significance testing must account for multiple comparisons (false discovery rate control) when testing many variants simultaneously. Canary deployments allocate 1-5% of traffic to new models initially, monitoring metrics for 24 hours before full rollout. Rollback mechanisms enable instant reversion if metrics degrade. For low-traffic products with less than 10K daily users, batch evaluation and weekly experiments provide better statistical power than continuous online testing.

Advanced online evaluation architectures implement sophisticated bandits and adaptive sampling. Thompson sampling maintains Bayesian beliefs about variant performance, samples from posterior distribution, balances exploration (gathering information) vs exploitation (using best known option). Upper confidence bound algorithms use optimism under uncertainty: allocate traffic proportional to confidence interval upper bounds. Contextual bandits extend to personalization: track variant performance for different user segments (device type, location, language), allocate traffic adaptively per segment. Regret minimization formalizes the cost of exploration: standard multi-armed bandit approaches incur O(log(n)) regret over n decisions, while contextual approaches incur O(sqrt(n) * d) regret for d-dimensional contexts. Implementation details matter significantly: batch updates (10K samples at a time) provide better convergence than per-sample updates. Statistical power calculations determine required sample sizes: detecting 1% lift requires 500K samples per variant for 80% power and 5% significance level. Sequential testing enables early stopping: SpeedSeq and SPRT (Sequential Probability Ratio Test) stop experiments early while controlling false positive rate. Multi-armed bandit applications show 5-15% winner improvement versus traditional A/B tests by reallocating traffic from losers to winners earlier in evaluation.

Integration of online evaluation with machine learning systems creates feedback loops for continuous improvement. Model versioning system maintains canary deployments: new models serve 1% of traffic initially, promoted through staged rollouts (5%, 25%, 100%) as metrics improve. Automated rollback triggers on key metric regression: if latency increases >10% or error rate >2%, immediately revert to previous version. Learning from online evaluation involves: aggregating signals, filtering noise, identifying causal effects from correlations. Confounding variables must be controlled: time-of-day effects, seasonal patterns, external events. A/A tests (identical variants) measure noise floor: ~3-5% of metrics show significant differences by chance alone. Proper statistical accounting: Bonferroni correction for multiple comparisons reduces false positives but increases false negatives. Online evaluation provides real-world performance data complementing offline metrics: offline metrics measure accuracy on held-out data, online metrics measure business outcomes (engagement, satisfaction, retention). The gap between offline and online metrics often reveals model issues: high offline accuracy but low online engagement suggests model optimizes for wrong objective or doesn't generalize to production distribution.

Statistical power analysis for online experiments determines sample sizes needed for reliable conclusions. Detecting 1% lift (improved metric) requires ~500K impressions per variant (80% statistical power, 5% significance level). For 1000 daily users, 500K samples takes 500 days—too long for rapid iteration. Practical solutions: focus on high-impact metrics (reduce power requirement by looking at primary metric only), increase traffic allocation to top-N variants (reduce sample size by concentrating observation), use Bayesian methods (more powerful for small samples). Synthetic control methods use historical data to estimate counterfactual: compare current variant against model of "what would have happened without change." Improves statistical power by 50-200% by reducing variance (explaining variance through historical patterns). Stratified analysis: segment users (new vs returning, device type, geography) and run separate analyses, identifying if variant helps specific segments more than others. Heterogeneous treatment effects: some users benefit more from variant than others—tracking this enables targeting improvements specifically to receptive segments. Time-series analysis: account for time-of-day effects, day-of-week effects (Monday different from Friday), seasonal effects (weekends different from weekdays). Detrending removes these known patterns, improving statistical power. Multi-metric analysis with proper correction: each metric has 5% false positive rate by chance, multiple metrics multiply false positive risk (Bonferroni, FDR correction). Practical framework: primary metric (blocks deployment), secondary metrics (informational), guardrail metrics (must not regress >2%).

Real-time metrics computation at scale requires specialized infrastructure. Streaming frameworks like Kafka compute metrics continuously with aggregation. Time-windowed aggregations compute metric values per minute, hour, day. Alert thresholds set dynamically based on recent history rather than static values. Dynamic thresholding sets alerts to recent mean minus 3 times standard deviation. Multiple time windows catch different anomalies: 5-minute catches spikes, hourly catches sustained degradation. Dashboards show timeseries charts, confidence intervals, statistical tests. Automated diagnosis correlates metric X with deployment Y or traffic spike Y. Alert fatigue reduction routes alerts appropriately, deduplicates duplicates, implements escalation. Incident response integration creates tickets automatically and notifies on-call engineers. Post-mortem analysis compares before-after metrics and identifies improvements. Business metrics like revenue take priority over technical metrics like latency which serve as diagnostics.

Signal Type	Volume/day	Latency	Noise Level
Page Load Time	1M+	Immediate	Low
User Thumbs Up/Down	10K	Immediate	Medium
Task Completion Rate	100K	Hours	Medium
Expert Rating (Batch)	100-1K	Days	Low

Signal type	Coverage	Latency	Example
Implicit (click, copy)	High (~100%)	Real-time	User copied response text
Explicit (thumbs up/down)	Low (1–5%)	Real-time	User rated response
LLM judge (sampled)	Configurable	Minutes	Automated faithfulness score
Human review (sampled)	Very low (<1%)	Hours–days	Expert quality review