DeepEval

Why DeepEval
Core metrics
Writing eval tests
RAG evaluation pipeline
Custom metrics
CI/CD integration
Gotchas

SECTION 01

Why DeepEval

Evaluating LLM outputs is hard because there's rarely a ground-truth string to compare against. DeepEval solves this by using LLMs as judges — measuring properties like factual accuracy, contextual relevance, and hallucination through carefully engineered evaluation prompts.

It follows a pytest-like pattern: write test cases, define pass/fail thresholds, and run evals as part of your CI/CD pipeline. When a model update causes answer quality to regress, the eval suite catches it before it reaches production.

pip install deepeval
deepeval login  # Optional: connect to Confident AI dashboard

SECTION 02

Core metrics

DeepEval provides metrics for common LLM failure modes:

AnswerRelevancyMetric: Does the answer actually address the question?
FaithfulnessMetric: Does the answer stay within the provided context (no hallucination)?
ContextualPrecisionMetric: Are retrieved chunks relevant to the question?
ContextualRecallMetric: Does the retrieved context contain enough information to answer?
HallucinationMetric: Does the answer contain facts not present in the context?
BiasMetric: Is the response politically or socially biased?
ToxicityMetric: Does the response contain harmful language?
GEval: Custom evaluation criteria defined in natural language.

All metrics use LLM-as-judge under the hood (default: GPT-4o), returning a float score 0–1 with an explanation string.

SECTION 03

Writing eval tests

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

def test_rag_answer_quality():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
        expected_output="Paris",  # optional ground truth
        retrieval_context=["France is a country in Western Europe. Its capital is Paris."],
    )
    assert_test(test_case, [
        AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini"),
        FaithfulnessMetric(threshold=0.8, model="gpt-4o-mini"),
    ])

# Run with: deepeval test run test_rag.py
# Or with pytest: pytest test_rag.py -v

Tests fail if any metric score falls below its threshold. The failure message includes the metric's reasoning, helping you understand why the model underperformed.

SECTION 04

RAG evaluation pipeline

from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
)
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval import evaluate

# Create test cases from your golden dataset
dataset = EvaluationDataset()
for item in golden_dataset:
    # Run your RAG pipeline to get actual outputs + retrieved context
    retrieved_chunks, answer = rag_pipeline(item["question"])
    dataset.add_test_case(LLMTestCase(
        input=item["question"],
        actual_output=answer,
        expected_output=item["expected_answer"],
        retrieval_context=retrieved_chunks,
    ))

# Define metrics
metrics = [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
    ContextualPrecisionMetric(threshold=0.6),
    ContextualRecallMetric(threshold=0.7),
]

# Evaluate all test cases
results = evaluate(dataset, metrics)
print(f"Overall pass rate: {results.overall_pass_rate:.1%}")

SECTION 05

Custom metrics

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# GEval: define evaluation criteria in natural language
code_quality_metric = GEval(
    name="CodeCorrectness",
    criteria="Determine if the generated code is syntactically correct and implements the described functionality without bugs.",
    evaluation_params=[
        LLMTestCaseParams.INPUT,        # the problem description
        LLMTestCaseParams.ACTUAL_OUTPUT, # the generated code
    ],
    threshold=0.7,
    model="gpt-4o",
)

# Custom metric class
from deepeval.metrics import BaseMetric

class SQLValidityMetric(BaseMetric):
    def __init__(self, threshold=0.5):
        self.threshold = threshold
        self.name = "SQL Validity"

    def measure(self, test_case: LLMTestCase) -> float:
        sql = test_case.actual_output
        try:
            import sqlparse
            sqlparse.parse(sql)[0]  # raises if invalid
            self.score = 1.0
            self.reason = "Valid SQL syntax"
        except Exception as e:
            self.score = 0.0
            self.reason = f"Invalid SQL: {e}"
        return self.score

    def is_successful(self) -> bool:
        return self.score >= self.threshold

SECTION 06

CI/CD integration

# .github/workflows/eval.yml
name: LLM Eval Suite
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install dependencies
        run: pip install deepeval openai
      - name: Run eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
        run: |
          deepeval test run tests/eval_suite.py             --exit-on-first-failure             --max-concurrent 5

With Confident AI (DeepEval's hosted dashboard), results are stored and compared across runs automatically. You can set regression alerts: if overall pass rate drops by >5% compared to the previous run, the CI fails.

SECTION 07

Gotchas

Evaluation cost: DeepEval uses GPT-4o-mini or GPT-4o as the judge. Evaluating 100 test cases with 4 metrics = ~400 LLM calls. At $0.15/1M input tokens (GPT-4o-mini), a 100-case suite with 500-token contexts costs ~$0.30. Still, budget this into your CI costs.

Judge model bias: LLM judges have their own biases (verbose answers score higher, etc.). Calibrate by running the same test case multiple times — variance >0.2 suggests the metric is unreliable for that type of question.

Metric interdependence: FaithfulnessMetric measures whether the answer is supported by context. AnswerRelevancyMetric measures whether the answer addresses the question. A model can score high on faithfulness but low on relevancy (accurately copying irrelevant context). Use both.

Async execution: Set max_concurrent to control parallel judge calls. Too high and you hit rate limits; too low and evals are slow. Start at 5–10 for GPT-4o-mini.

DeepEval Metric Suite Overview

DeepEval is a Python-based LLM evaluation framework that provides a comprehensive set of pre-built metrics for RAG, agent, and general LLM application testing. It integrates with pytest, enabling LLM quality checks as part of standard test suites with pass/fail thresholds and detailed failure reporting.

Metric	Evaluates	Threshold Type	LLM Judge
AnswerRelevancy	Answer addresses the input	Score 0–1	Yes
Faithfulness	Answer grounded in context	Score 0–1	Yes
ContextualPrecision	Retrieved context relevance	Score 0–1	Yes
Hallucination	Factual accuracy	Score 0–1	Yes
BiasMetric	Opinion/bias presence	Score 0–1	Yes
ToxicityMetric	Harmful content	Score 0–1	Via classifier

DeepEval's pytest integration allows embedding LLM quality checks directly into CI/CD pipelines using familiar Python testing patterns. A test case defines the input, actual output, and optional expected output and retrieval context; metric assertions specify the minimum acceptable score for each quality dimension. When a metric falls below threshold, the test fails with a human-readable explanation of what the judging LLM found problematic, making it straightforward to diagnose and fix the underlying issue rather than just knowing a score threshold was breached.

DeepEval Confident AI is the hosted companion service that provides a web dashboard for visualizing evaluation results, running batch evaluations, and managing evaluation datasets without requiring local infrastructure. For teams without dedicated MLOps resources, the hosted option enables production quality monitoring with minimal setup — instrumentation is added to the application, evaluation runs are submitted to the hosted service, and quality trends are visible in the dashboard without maintaining evaluation infrastructure.

DeepEval's test case synthesis feature generates additional test cases from existing examples by applying perturbation techniques: paraphrasing the input, introducing typos, changing numerical values, negating claims, or adding irrelevant context. This synthetic augmentation expands coverage of the evaluation dataset without requiring additional human annotation effort, helping identify prompt robustness issues where small input variations cause significant quality degradation. Evaluating on perturbed inputs alongside original inputs provides a more conservative and reliable quality estimate than evaluation on clean, ideally-phrased inputs alone.

DeepEval's G-Eval framework provides a flexible metric definition system based on evaluation criteria expressed in natural language. Rather than implementing a specialized feedback function for each quality dimension, developers write evaluation criteria as plain English statements: "The response should acknowledge uncertainty when the question cannot be answered confidently from the provided context." The G-Eval framework constructs an evaluation prompt from these criteria and uses an LLM judge to score responses against them, enabling rapid definition of domain-specific metrics without writing scoring logic from scratch.

Red-teaming integration in DeepEval automatically generates adversarial inputs targeting specific vulnerability categories: prompt injection, jailbreak attempts, data leakage, hallucination triggers, and PII exposure. Running the red-team suite before production deployment provides a systematic safety baseline and documents known attack vectors that the system handles correctly. Unlike manual red-teaming that depends on human creativity, automated red-teaming scales to thousands of attack variants and can be run on every code change to detect newly introduced vulnerabilities before they reach production.

DeepEval's evaluation dataset versioning tracks which version of a dataset was used for each evaluation run, enabling meaningful comparisons across time. When dataset examples are added, modified, or removed — because better examples were found, old examples became irrelevant, or quality thresholds were recalibrated — the versioning ensures that score changes are attributable to model improvements rather than dataset composition changes. This audit trail is particularly important for regulatory compliance contexts where evaluation methodology must be documented and traceable.