DeepEval is an open-source LLM evaluation framework with 14+ built-in metrics (answer relevancy, faithfulness, hallucination, bias, toxicity), a pytest-like test runner, and a hosted evaluation dashboard (Confident AI).
Evaluating LLM outputs is hard because there's rarely a ground-truth string to compare against. DeepEval solves this by using LLMs as judges — measuring properties like factual accuracy, contextual relevance, and hallucination through carefully engineered evaluation prompts.
It follows a pytest-like pattern: write test cases, define pass/fail thresholds, and run evals as part of your CI/CD pipeline. When a model update causes answer quality to regress, the eval suite catches it before it reaches production.
pip install deepeval
deepeval login # Optional: connect to Confident AI dashboard
DeepEval provides metrics for common LLM failure modes:
All metrics use LLM-as-judge under the hood (default: GPT-4o), returning a float score 0–1 with an explanation string.
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
def test_rag_answer_quality():
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
expected_output="Paris", # optional ground truth
retrieval_context=["France is a country in Western Europe. Its capital is Paris."],
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini"),
FaithfulnessMetric(threshold=0.8, model="gpt-4o-mini"),
])
# Run with: deepeval test run test_rag.py
# Or with pytest: pytest test_rag.py -v
Tests fail if any metric score falls below its threshold. The failure message includes the metric's reasoning, helping you understand why the model underperformed.
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualPrecisionMetric,
ContextualRecallMetric,
)
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval import evaluate
# Create test cases from your golden dataset
dataset = EvaluationDataset()
for item in golden_dataset:
# Run your RAG pipeline to get actual outputs + retrieved context
retrieved_chunks, answer = rag_pipeline(item["question"])
dataset.add_test_case(LLMTestCase(
input=item["question"],
actual_output=answer,
expected_output=item["expected_answer"],
retrieval_context=retrieved_chunks,
))
# Define metrics
metrics = [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
ContextualPrecisionMetric(threshold=0.6),
ContextualRecallMetric(threshold=0.7),
]
# Evaluate all test cases
results = evaluate(dataset, metrics)
print(f"Overall pass rate: {results.overall_pass_rate:.1%}")
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
# GEval: define evaluation criteria in natural language
code_quality_metric = GEval(
name="CodeCorrectness",
criteria="Determine if the generated code is syntactically correct and implements the described functionality without bugs.",
evaluation_params=[
LLMTestCaseParams.INPUT, # the problem description
LLMTestCaseParams.ACTUAL_OUTPUT, # the generated code
],
threshold=0.7,
model="gpt-4o",
)
# Custom metric class
from deepeval.metrics import BaseMetric
class SQLValidityMetric(BaseMetric):
def __init__(self, threshold=0.5):
self.threshold = threshold
self.name = "SQL Validity"
def measure(self, test_case: LLMTestCase) -> float:
sql = test_case.actual_output
try:
import sqlparse
sqlparse.parse(sql)[0] # raises if invalid
self.score = 1.0
self.reason = "Valid SQL syntax"
except Exception as e:
self.score = 0.0
self.reason = f"Invalid SQL: {e}"
return self.score
def is_successful(self) -> bool:
return self.score >= self.threshold
# .github/workflows/eval.yml
name: LLM Eval Suite
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install dependencies
run: pip install deepeval openai
- name: Run eval suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
run: |
deepeval test run tests/eval_suite.py --exit-on-first-failure --max-concurrent 5
With Confident AI (DeepEval's hosted dashboard), results are stored and compared across runs automatically. You can set regression alerts: if overall pass rate drops by >5% compared to the previous run, the CI fails.
Evaluation cost: DeepEval uses GPT-4o-mini or GPT-4o as the judge. Evaluating 100 test cases with 4 metrics = ~400 LLM calls. At $0.15/1M input tokens (GPT-4o-mini), a 100-case suite with 500-token contexts costs ~$0.30. Still, budget this into your CI costs.
Judge model bias: LLM judges have their own biases (verbose answers score higher, etc.). Calibrate by running the same test case multiple times — variance >0.2 suggests the metric is unreliable for that type of question.
Metric interdependence: FaithfulnessMetric measures whether the answer is supported by context. AnswerRelevancyMetric measures whether the answer addresses the question. A model can score high on faithfulness but low on relevancy (accurately copying irrelevant context). Use both.
Async execution: Set max_concurrent to control parallel judge calls. Too high and you hit rate limits; too low and evals are slow. Start at 5–10 for GPT-4o-mini.
DeepEval is a Python-based LLM evaluation framework that provides a comprehensive set of pre-built metrics for RAG, agent, and general LLM application testing. It integrates with pytest, enabling LLM quality checks as part of standard test suites with pass/fail thresholds and detailed failure reporting.
| Metric | Evaluates | Threshold Type | LLM Judge |
|---|---|---|---|
| AnswerRelevancy | Answer addresses the input | Score 0–1 | Yes |
| Faithfulness | Answer grounded in context | Score 0–1 | Yes |
| ContextualPrecision | Retrieved context relevance | Score 0–1 | Yes |
| Hallucination | Factual accuracy | Score 0–1 | Yes |
| BiasMetric | Opinion/bias presence | Score 0–1 | Yes |
| ToxicityMetric | Harmful content | Score 0–1 | Via classifier |
DeepEval's pytest integration allows embedding LLM quality checks directly into CI/CD pipelines using familiar Python testing patterns. A test case defines the input, actual output, and optional expected output and retrieval context; metric assertions specify the minimum acceptable score for each quality dimension. When a metric falls below threshold, the test fails with a human-readable explanation of what the judging LLM found problematic, making it straightforward to diagnose and fix the underlying issue rather than just knowing a score threshold was breached.
DeepEval Confident AI is the hosted companion service that provides a web dashboard for visualizing evaluation results, running batch evaluations, and managing evaluation datasets without requiring local infrastructure. For teams without dedicated MLOps resources, the hosted option enables production quality monitoring with minimal setup — instrumentation is added to the application, evaluation runs are submitted to the hosted service, and quality trends are visible in the dashboard without maintaining evaluation infrastructure.
DeepEval's test case synthesis feature generates additional test cases from existing examples by applying perturbation techniques: paraphrasing the input, introducing typos, changing numerical values, negating claims, or adding irrelevant context. This synthetic augmentation expands coverage of the evaluation dataset without requiring additional human annotation effort, helping identify prompt robustness issues where small input variations cause significant quality degradation. Evaluating on perturbed inputs alongside original inputs provides a more conservative and reliable quality estimate than evaluation on clean, ideally-phrased inputs alone.
DeepEval's G-Eval framework provides a flexible metric definition system based on evaluation criteria expressed in natural language. Rather than implementing a specialized feedback function for each quality dimension, developers write evaluation criteria as plain English statements: "The response should acknowledge uncertainty when the question cannot be answered confidently from the provided context." The G-Eval framework constructs an evaluation prompt from these criteria and uses an LLM judge to score responses against them, enabling rapid definition of domain-specific metrics without writing scoring logic from scratch.
Red-teaming integration in DeepEval automatically generates adversarial inputs targeting specific vulnerability categories: prompt injection, jailbreak attempts, data leakage, hallucination triggers, and PII exposure. Running the red-team suite before production deployment provides a systematic safety baseline and documents known attack vectors that the system handles correctly. Unlike manual red-teaming that depends on human creativity, automated red-teaming scales to thousands of attack variants and can be run on every code change to detect newly introduced vulnerabilities before they reach production.
DeepEval's evaluation dataset versioning tracks which version of a dataset was used for each evaluation run, enabling meaningful comparisons across time. When dataset examples are added, modified, or removed — because better examples were found, old examples became irrelevant, or quality thresholds were recalibrated — the versioning ensures that score changes are attributable to model improvements rather than dataset composition changes. This audit trail is particularly important for regulatory compliance contexts where evaluation methodology must be documented and traceable.