01 — Framework
RAG Evaluation Dimensions
RAG systems have multiple failure points: the retriever might return irrelevant chunks, the generator might hallucinate, or the combination might be incoherent. You need metrics at three levels: retrieval, generation, and end-to-end.
Three Evaluation Levels
- Retrieval quality: Are the chunks returned by the retriever actually relevant? Measured by precision, recall, MRR, NDCG.
- Generation quality: Is the LLM answer faithful to the retrieved context and factually correct? Measured by faithfulness, relevance, ROUGE, BERTScore.
- End-to-end: Does the full pipeline produce useful outputs? Measured by task-specific metrics and human judgment.
Most teams optimize retrieval heavily but neglect generation metrics. A perfect retriever with a hallucinating LLM is still broken. Measure all three.
💡
Automate what you can; human-evaluate what matters. Automated metrics are fast and reproducible but can mislead. Reserve human evaluation for the final 10% of improvements and quarterly validation.
02 — Metrics
Core Metrics Explained
RAGAS (RAG Assessment) defines five core metrics that work across retrieval and generation. These are automated metrics computed using LLM scoring.
| Metric | Measures | Range | Computation |
| Context Precision |
% of retrieved chunks actually relevant |
0–1 |
LLM ranks chunks; precision@k |
| Context Recall |
% of needed info present in chunks |
0–1 |
LLM checks if chunks contain answer |
| Faithfulness |
Answer supported by context |
0–1 |
LLM extracts claims; checks support |
| Answer Relevance |
Answer addresses the question |
0–1 |
LLM scores question-answer alignment |
| Answer Correctness |
Answer is factually accurate |
0–1 |
LLM or human comparison to golden answer |
Metric Interpretation
Context Precision (0.8+): 80% of chunks are useful. Good retrieval quality. Context Recall (0.7+): 70% of needed information is present. May need denser chunks or better retrieval. Faithfulness (0.8+): Answers mostly grounded in context. Some hallucination still present. Answer Relevance (0.85+): Answers address questions directly. Answer Correctness: Requires gold standard; hardest to automate.
03 — Framework
RAGAS Framework
RAGAS (Retrieval-Augmented Generation Assessment) is the standard evaluation framework. It provides automated LLM-based metrics and synthetic test set generation. Install via pip install ragas.
RAGAS Evaluation Pipeline
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevance
)
from datasets import Dataset
# Your RAG outputs
rag_results = {
"question": [q1, q2, ...],
"answer": [a1, a2, ...],
"contexts": [[c1, c2, ...], ...], # retrieved chunks
"ground_truth": [gt1, gt2, ...] # optional
}
dataset = Dataset.from_dict(rag_results)
# Run evaluation
results = evaluate(
dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevance
],
llm=ChatOpenAI(model="gpt-4")
)
print(results)
# Output:
# context_precision: 0.82
# context_recall: 0.75
# faithfulness: 0.88
# answer_relevance: 0.91
Synthetic Test Set Generation
RAGAS can generate test questions and golden answers from your documents:
from ragas.testset_generator import TestsetGenerator
from langchain.document_loaders import DirectoryLoader
# Load your documents
loader = DirectoryLoader("./docs")
documents = loader.load()
# Generate test set
generator = TestsetGenerator.with_openai()
testset = generator.generate_with_llamaindex(
documents,
test_size=100, # num questions to generate
distributions={
"simple": 0.5,
"reasoning": 0.3,
"multi_context": 0.2
}
)
# Export
testset.to_csv("test_set.csv")
testset.to_json("test_set.json")
⚠️
Synthetic test sets bias toward easy questions. LLMs tend to generate questions they can answer. Supplement with human-written questions, especially for edge cases.
04 — Component Metrics
Component-Level Evaluation
Beyond RAGAS, evaluate each component separately. Retriever and generator have distinct failure modes.
Retriever Metrics
| Metric | Definition | Good Value |
| MRR |
Mean Reciprocal Rank — position of first relevant |
0.8+ |
| NDCG@10 |
Normalized Discounted Cumulative Gain |
0.75+ |
| Recall@k |
% of gold chunks in top k results |
Recall@5 > 0.6 |
| Precision@k |
% of returned chunks are relevant |
Precision@5 > 0.5 |
Generator Metrics
| Metric | Measures | Method |
| ROUGE |
Lexical overlap with reference |
F1 score on n-grams |
| BERTScore |
Semantic similarity |
Contextual embeddings |
| LLM-Judge |
Quality per criteria |
Pairwise comparison |
| Human Eval |
Correctness, usefulness |
Annotation by domain experts |
from rouge import Rouge
from bert_score import score as bert_score
# ROUGE
rouge = Rouge()
scores = rouge.get_scores(
generated_answer,
reference_answer
)
print(scores[0]['rouge1']) # ROUGE-1 F1
# BERTScore
P, R, F1 = bert_score(
[generated_answer],
[reference_answer],
lang="en"
)
print(F"BERTScore F1: {F1.item():.3f}")
05 — Alternatives
TruLens & DeepEval
RAGAS is great for retrieval-augmented tasks. For broader application monitoring, TruLens and DeepEval offer frameworks beyond RAG.
TruLens RAG Triad
TruLens proposes the RAG Triad: answer relevance, context relevance, and groundedness. Think of it as a simplified alternative to RAGAS.
| TruLens Metric | What it measures | vs RAGAS |
| Answer Relevance |
Does answer address question? |
Same as RAGAS answer_relevance |
| Context Relevance |
Is context relevant to question? |
Similar to context_precision |
| Groundedness |
Is answer grounded in context? |
Same as RAGAS faithfulness |
from trulens_eval import Tru, TruChain
from trulens_eval.feedback import Groundedness,
AnswerRelevance, ContextRelevance
tru = Tru()
# Define feedback functions
groundedness = Groundedness(model_endpoint=openai_endpoint)
answer_relevance = AnswerRelevance(model_endpoint=openai_endpoint)
context_relevance = ContextRelevance(model_endpoint=openai_endpoint)
# Wrap your RAG chain
tru_chain = TruChain(
rag_chain,
app_id="my_rag_app",
feedback_functions=[
groundedness.groundedness_measure_refusal,
answer_relevance.answer_relevance,
context_relevance.context_relevance
]
)
# Run and get feedback
with tru_context():
response = tru_chain(question)
tru.run_dashboard() # View results
DeepEval
DeepEval is lightweight and modular. Good for teams wanting minimal dependencies.
from deepeval import evaluate
from deepeval.metrics import Faithfulness, Relevance
# Run tests
test_cases = [
{"question": q, "answer": a, "context": c}
for q, a, c in zip(questions, answers, contexts)
]
metrics = [
Faithfulness(threshold=0.8),
Relevance(threshold=0.85)
]
results = evaluate(test_cases, metrics)
print(f"Pass rate: {results['pass_rate']:.2%}")
06 — Data
Building Evaluation Datasets
Good evaluation datasets have golden Q&A pairs, retrieval groundtruth, and coverage of edge cases. Build incrementally as you find failures in production.
Dataset Structure
{
"question": "What is the difference between RAGAS and TruLens?",
"answer": "RAGAS is more comprehensive with 5 metrics...",
"contexts": [
"RAGAS measures context precision and recall...",
"TruLens uses the RAG triad framework..."
],
"ground_truth": "RAGAS has more granular metrics for retrieval",
"retrieval_groundtruth": [0, 1], # which chunks are relevant
"difficulty": "medium",
"category": "comparison"
}
Building Strategy
- Start with 50 golden examples: High-quality, hand-verified Q&A pairs from your domain.
- Expand with synthetic data: Use RAGAS to generate 200–500 test questions from your corpus.
- Add production failures: Monitor live queries; add ones where RAG fails to evaluation set.
- Cover edge cases: Multi-hop reasoning, temporal constraints, negations, out-of-domain questions.
💡
Use stratified sampling. Ensure evaluation set reflects category/difficulty distribution of real queries. Random sampling misses rare but important cases.
07 — Automation
CI/CD for RAG Evals
Integrate evaluation into your deployment pipeline. Run evals on every PR; track metrics over time; alert on regressions.
GitHub Actions Example
name: RAG Evaluation
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install ragas datasets langchain
- name: Run RAG evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python scripts/evaluate_rag.py
- name: Check for regressions
run: |
python scripts/check_regressions.py \
--baseline results/main.json \
--current results/current.json \
--threshold 0.02 # 2% drop allowed
- name: Comment on PR
if: failure()
uses: actions/github-script@v6
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '⚠️ RAG evaluation shows regression. Check results.'
})
Metrics Dashboard
Use tools like Arize, Weights & Biases, or LangSmith to track eval metrics over time. Watch for:
- Context precision drift: If retriever quality degrades, update embeddings or re-index.
- Faithfulness drop: May indicate model is hallucinating more; retune temperature or add guardrails.
- Answer relevance variance: Questions becoming harder? Check user query distribution changes.
09 — Further Reading
References
Academic Papers
-
Paper
Es, S. et al. (2023).
RAGAS: A Unified Framework for Evaluating Retrieval-Augmented Generation.
arXiv:2309.15217. —
arxiv:2309.15217 ↗
-
Paper
Gao, T. et al. (2023).
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary.
arXiv:2010.00490. —
arxiv:2010.00490 ↗
-
Paper
Zhang, T. et al. (2019).
BERTScore: Evaluating Text Generation with BERT.
ICLR. —
arxiv:1904.09675 ↗
Documentation & Guides
Practitioner Writing