RAG Evaluation

Contents

Evaluation dimensions
Core metrics
RAGAS framework
Component evaluation
TruLens & DeepEval
Building eval datasets
CI/CD for RAG
Tools & references

01 — Framework

RAG Evaluation Dimensions

RAG systems have multiple failure points: the retriever might return irrelevant chunks, the generator might hallucinate, or the combination might be incoherent. You need metrics at three levels: retrieval, generation, and end-to-end.

Three Evaluation Levels

Retrieval quality: Are the chunks returned by the retriever actually relevant? Measured by precision, recall, MRR, NDCG.
Generation quality: Is the LLM answer faithful to the retrieved context and factually correct? Measured by faithfulness, relevance, ROUGE, BERTScore.
End-to-end: Does the full pipeline produce useful outputs? Measured by task-specific metrics and human judgment.

Most teams optimize retrieval heavily but neglect generation metrics. A perfect retriever with a hallucinating LLM is still broken. Measure all three.

💡 Automate what you can; human-evaluate what matters. Automated metrics are fast and reproducible but can mislead. Reserve human evaluation for the final 10% of improvements and quarterly validation.

02 — Metrics

Core Metrics Explained

RAGAS (RAG Assessment) defines five core metrics that work across retrieval and generation. These are automated metrics computed using LLM scoring.

Metric	Measures	Range	Computation
Context Precision	% of retrieved chunks actually relevant	0–1	LLM ranks chunks; precision@k
Context Recall	% of needed info present in chunks	0–1	LLM checks if chunks contain answer
Faithfulness	Answer supported by context	0–1	LLM extracts claims; checks support
Answer Relevance	Answer addresses the question	0–1	LLM scores question-answer alignment
Answer Correctness	Answer is factually accurate	0–1	LLM or human comparison to golden answer

Metric Interpretation

Context Precision (0.8+): 80% of chunks are useful. Good retrieval quality. Context Recall (0.7+): 70% of needed information is present. May need denser chunks or better retrieval. Faithfulness (0.8+): Answers mostly grounded in context. Some hallucination still present. Answer Relevance (0.85+): Answers address questions directly. Answer Correctness: Requires gold standard; hardest to automate.

03 — Framework

RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment) is the standard evaluation framework. It provides automated LLM-based metrics and synthetic test set generation. Install via pip install ragas.

RAGAS Evaluation Pipeline

from ragas import evaluate from ragas.metrics import ( context_precision, context_recall, faithfulness, answer_relevance ) from datasets import Dataset # Your RAG outputs rag_results = { "question": [q1, q2, ...], "answer": [a1, a2, ...], "contexts": [[c1, c2, ...], ...], # retrieved chunks "ground_truth": [gt1, gt2, ...] # optional } dataset = Dataset.from_dict(rag_results) # Run evaluation results = evaluate( dataset, metrics=[ context_precision, context_recall, faithfulness, answer_relevance ], llm=ChatOpenAI(model="gpt-4") ) print(results) # Output: # context_precision: 0.82 # context_recall: 0.75 # faithfulness: 0.88 # answer_relevance: 0.91

Synthetic Test Set Generation

RAGAS can generate test questions and golden answers from your documents:

from ragas.testset_generator import TestsetGenerator from langchain.document_loaders import DirectoryLoader # Load your documents loader = DirectoryLoader("./docs") documents = loader.load() # Generate test set generator = TestsetGenerator.with_openai() testset = generator.generate_with_llamaindex( documents, test_size=100, # num questions to generate distributions={ "simple": 0.5, "reasoning": 0.3, "multi_context": 0.2 } ) # Export testset.to_csv("test_set.csv") testset.to_json("test_set.json")

⚠️ Synthetic test sets bias toward easy questions. LLMs tend to generate questions they can answer. Supplement with human-written questions, especially for edge cases.

04 — Component Metrics

Component-Level Evaluation

Beyond RAGAS, evaluate each component separately. Retriever and generator have distinct failure modes.

Retriever Metrics

Metric	Definition	Good Value
MRR	Mean Reciprocal Rank — position of first relevant	0.8+
NDCG@10	Normalized Discounted Cumulative Gain	0.75+
Recall@k	% of gold chunks in top k results	Recall@5 > 0.6
Precision@k	% of returned chunks are relevant	Precision@5 > 0.5

Generator Metrics

Metric	Measures	Method
ROUGE	Lexical overlap with reference	F1 score on n-grams
BERTScore	Semantic similarity	Contextual embeddings
LLM-Judge	Quality per criteria	Pairwise comparison
Human Eval	Correctness, usefulness	Annotation by domain experts

from rouge import Rouge from bert_score import score as bert_score # ROUGE rouge = Rouge() scores = rouge.get_scores( generated_answer, reference_answer ) print(scores[0]['rouge1']) # ROUGE-1 F1 # BERTScore P, R, F1 = bert_score( [generated_answer], [reference_answer], lang="en" ) print(F"BERTScore F1: {F1.item():.3f}")

05 — Alternatives

TruLens & DeepEval

RAGAS is great for retrieval-augmented tasks. For broader application monitoring, TruLens and DeepEval offer frameworks beyond RAG.

TruLens RAG Triad

TruLens proposes the RAG Triad: answer relevance, context relevance, and groundedness. Think of it as a simplified alternative to RAGAS.

TruLens Metric	What it measures	vs RAGAS
Answer Relevance	Does answer address question?	Same as RAGAS answer_relevance
Context Relevance	Is context relevant to question?	Similar to context_precision
Groundedness	Is answer grounded in context?	Same as RAGAS faithfulness

from trulens_eval import Tru, TruChain from trulens_eval.feedback import Groundedness, AnswerRelevance, ContextRelevance tru = Tru() # Define feedback functions groundedness = Groundedness(model_endpoint=openai_endpoint) answer_relevance = AnswerRelevance(model_endpoint=openai_endpoint) context_relevance = ContextRelevance(model_endpoint=openai_endpoint) # Wrap your RAG chain tru_chain = TruChain( rag_chain, app_id="my_rag_app", feedback_functions=[ groundedness.groundedness_measure_refusal, answer_relevance.answer_relevance, context_relevance.context_relevance ] ) # Run and get feedback with tru_context(): response = tru_chain(question) tru.run_dashboard() # View results

DeepEval

DeepEval is lightweight and modular. Good for teams wanting minimal dependencies.

from deepeval import evaluate from deepeval.metrics import Faithfulness, Relevance # Run tests test_cases = [ {"question": q, "answer": a, "context": c} for q, a, c in zip(questions, answers, contexts) ] metrics = [ Faithfulness(threshold=0.8), Relevance(threshold=0.85) ] results = evaluate(test_cases, metrics) print(f"Pass rate: {results['pass_rate']:.2%}")

06 — Data

Building Evaluation Datasets

Good evaluation datasets have golden Q&A pairs, retrieval groundtruth, and coverage of edge cases. Build incrementally as you find failures in production.

Dataset Structure

{ "question": "What is the difference between RAGAS and TruLens?", "answer": "RAGAS is more comprehensive with 5 metrics...", "contexts": [ "RAGAS measures context precision and recall...", "TruLens uses the RAG triad framework..." ], "ground_truth": "RAGAS has more granular metrics for retrieval", "retrieval_groundtruth": [0, 1], # which chunks are relevant "difficulty": "medium", "category": "comparison" }

Building Strategy

Start with 50 golden examples: High-quality, hand-verified Q&A pairs from your domain.
Expand with synthetic data: Use RAGAS to generate 200–500 test questions from your corpus.
Add production failures: Monitor live queries; add ones where RAG fails to evaluation set.
Cover edge cases: Multi-hop reasoning, temporal constraints, negations, out-of-domain questions.

💡 Use stratified sampling. Ensure evaluation set reflects category/difficulty distribution of real queries. Random sampling misses rare but important cases.

07 — Automation

CI/CD for RAG Evals

Integrate evaluation into your deployment pipeline. Run evals on every PR; track metrics over time; alert on regressions.

GitHub Actions Example

name: RAG Evaluation on: [pull_request] jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: '3.10' - name: Install dependencies run: | pip install ragas datasets langchain - name: Run RAG evaluation env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python scripts/evaluate_rag.py - name: Check for regressions run: | python scripts/check_regressions.py \ --baseline results/main.json \ --current results/current.json \ --threshold 0.02 # 2% drop allowed - name: Comment on PR if: failure() uses: actions/github-script@v6 with: script: | github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: '⚠️ RAG evaluation shows regression. Check results.' })

Metrics Dashboard

Use tools like Arize, Weights & Biases, or LangSmith to track eval metrics over time. Watch for:

Context precision drift: If retriever quality degrades, update embeddings or re-index.
Faithfulness drop: May indicate model is hallucinating more; retune temperature or add guardrails.
Answer relevance variance: Questions becoming harder? Check user query distribution changes.

08 — Ecosystem

Evaluation Tools

RAGAS

RAG Evaluation

Standard RAG evaluation framework. 5 core metrics, synthetic test set generation, LLM-based scoring.

TruLens

Application Monitoring

Lightweight feedback framework. RAG triad metrics, explainability, production monitoring dashboard.

DeepEval

LLM Testing

Modular evaluation library. Minimal dependencies, supports RAG + general LLM metrics.

LangSmith

Debugging & Monitoring

LangChain's evaluation platform. Trace runs, evaluate on datasets, track metrics in production.

UpTrain

Data Quality

Open-source data quality framework. Checks for data leakage, semantic drift, factuality.

BERTScore

Generation Metrics

Semantic similarity scoring. Contextual embeddings; works for any language pair.

Arize Phoenix

LLM Observability

Production monitoring. Traces, embeddings, retrieval evaluation, drift detection.

Continuous Eval

Workflow Integration

Drop-in evaluation for any pipeline. Supports batch and streaming eval modes.

09 — Further Reading

References

Academic Papers

Paper Es, S. et al. (2023). RAGAS: A Unified Framework for Evaluating Retrieval-Augmented Generation. arXiv:2309.15217. — arxiv:2309.15217 ↗
Paper Gao, T. et al. (2023). Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. arXiv:2010.00490. — arxiv:2010.00490 ↗
Paper Zhang, T. et al. (2019). BERTScore: Evaluating Text Generation with BERT. ICLR. — arxiv:1904.09675 ↗

Documentation & Guides

Docs RAGAS Documentation. docs.ragas.io ↗
Docs TruLens Feedback Functions. trulens.org ↗
Docs DeepEval Metrics. github.com/confident-ai/deepeval ↗
Guide LangChain — Evaluation guide. langchain.com/docs ↗

Practitioner Writing

Blog Cohere. (2023). Evaluating Retrieval Systems. — cohere.ai/blog ↗
Blog Hugging Face. (2024). RAG Evaluation Best Practices. — huggingface.co/blog ↗

RAG Evaluation

RAG Evaluation Dimensions

Three Evaluation Levels

Core Metrics Explained

Metric Interpretation

RAGAS Framework

RAGAS Evaluation Pipeline

Synthetic Test Set Generation

Component-Level Evaluation

Retriever Metrics

Generator Metrics

TruLens & DeepEval

TruLens RAG Triad

DeepEval

Building Evaluation Datasets

Dataset Structure

Building Strategy

CI/CD for RAG Evals

GitHub Actions Example

Metrics Dashboard

Evaluation Tools

References

Related concepts