RAG · Evaluation

RAG Evaluation

RAGAS, TruLens, and DeepEval — measuring retrieval quality, answer faithfulness, and end-to-end RAG performance.

5 Core metrics
Automated + Human Evaluation loop
RAG Triad TruLens framework
Contents
  1. Evaluation dimensions
  2. Core metrics
  3. RAGAS framework
  4. Component evaluation
  5. TruLens & DeepEval
  6. Building eval datasets
  7. CI/CD for RAG
  8. Tools & references
01 — Framework

RAG Evaluation Dimensions

RAG systems have multiple failure points: the retriever might return irrelevant chunks, the generator might hallucinate, or the combination might be incoherent. You need metrics at three levels: retrieval, generation, and end-to-end.

Three Evaluation Levels

Most teams optimize retrieval heavily but neglect generation metrics. A perfect retriever with a hallucinating LLM is still broken. Measure all three.

💡 Automate what you can; human-evaluate what matters. Automated metrics are fast and reproducible but can mislead. Reserve human evaluation for the final 10% of improvements and quarterly validation.
02 — Metrics

Core Metrics Explained

RAGAS (RAG Assessment) defines five core metrics that work across retrieval and generation. These are automated metrics computed using LLM scoring.

MetricMeasuresRangeComputation
Context Precision % of retrieved chunks actually relevant 0–1 LLM ranks chunks; precision@k
Context Recall % of needed info present in chunks 0–1 LLM checks if chunks contain answer
Faithfulness Answer supported by context 0–1 LLM extracts claims; checks support
Answer Relevance Answer addresses the question 0–1 LLM scores question-answer alignment
Answer Correctness Answer is factually accurate 0–1 LLM or human comparison to golden answer

Metric Interpretation

Context Precision (0.8+): 80% of chunks are useful. Good retrieval quality. Context Recall (0.7+): 70% of needed information is present. May need denser chunks or better retrieval. Faithfulness (0.8+): Answers mostly grounded in context. Some hallucination still present. Answer Relevance (0.85+): Answers address questions directly. Answer Correctness: Requires gold standard; hardest to automate.

03 — Framework

RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment) is the standard evaluation framework. It provides automated LLM-based metrics and synthetic test set generation. Install via pip install ragas.

RAGAS Evaluation Pipeline

from ragas import evaluate from ragas.metrics import ( context_precision, context_recall, faithfulness, answer_relevance ) from datasets import Dataset # Your RAG outputs rag_results = { "question": [q1, q2, ...], "answer": [a1, a2, ...], "contexts": [[c1, c2, ...], ...], # retrieved chunks "ground_truth": [gt1, gt2, ...] # optional } dataset = Dataset.from_dict(rag_results) # Run evaluation results = evaluate( dataset, metrics=[ context_precision, context_recall, faithfulness, answer_relevance ], llm=ChatOpenAI(model="gpt-4") ) print(results) # Output: # context_precision: 0.82 # context_recall: 0.75 # faithfulness: 0.88 # answer_relevance: 0.91

Synthetic Test Set Generation

RAGAS can generate test questions and golden answers from your documents:

from ragas.testset_generator import TestsetGenerator from langchain.document_loaders import DirectoryLoader # Load your documents loader = DirectoryLoader("./docs") documents = loader.load() # Generate test set generator = TestsetGenerator.with_openai() testset = generator.generate_with_llamaindex( documents, test_size=100, # num questions to generate distributions={ "simple": 0.5, "reasoning": 0.3, "multi_context": 0.2 } ) # Export testset.to_csv("test_set.csv") testset.to_json("test_set.json")
⚠️ Synthetic test sets bias toward easy questions. LLMs tend to generate questions they can answer. Supplement with human-written questions, especially for edge cases.
04 — Component Metrics

Component-Level Evaluation

Beyond RAGAS, evaluate each component separately. Retriever and generator have distinct failure modes.

Retriever Metrics

MetricDefinitionGood Value
MRR Mean Reciprocal Rank — position of first relevant 0.8+
NDCG@10 Normalized Discounted Cumulative Gain 0.75+
Recall@k % of gold chunks in top k results Recall@5 > 0.6
Precision@k % of returned chunks are relevant Precision@5 > 0.5

Generator Metrics

MetricMeasuresMethod
ROUGE Lexical overlap with reference F1 score on n-grams
BERTScore Semantic similarity Contextual embeddings
LLM-Judge Quality per criteria Pairwise comparison
Human Eval Correctness, usefulness Annotation by domain experts
from rouge import Rouge from bert_score import score as bert_score # ROUGE rouge = Rouge() scores = rouge.get_scores( generated_answer, reference_answer ) print(scores[0]['rouge1']) # ROUGE-1 F1 # BERTScore P, R, F1 = bert_score( [generated_answer], [reference_answer], lang="en" ) print(F"BERTScore F1: {F1.item():.3f}")
05 — Alternatives

TruLens & DeepEval

RAGAS is great for retrieval-augmented tasks. For broader application monitoring, TruLens and DeepEval offer frameworks beyond RAG.

TruLens RAG Triad

TruLens proposes the RAG Triad: answer relevance, context relevance, and groundedness. Think of it as a simplified alternative to RAGAS.

TruLens MetricWhat it measuresvs RAGAS
Answer Relevance Does answer address question? Same as RAGAS answer_relevance
Context Relevance Is context relevant to question? Similar to context_precision
Groundedness Is answer grounded in context? Same as RAGAS faithfulness
from trulens_eval import Tru, TruChain from trulens_eval.feedback import Groundedness, AnswerRelevance, ContextRelevance tru = Tru() # Define feedback functions groundedness = Groundedness(model_endpoint=openai_endpoint) answer_relevance = AnswerRelevance(model_endpoint=openai_endpoint) context_relevance = ContextRelevance(model_endpoint=openai_endpoint) # Wrap your RAG chain tru_chain = TruChain( rag_chain, app_id="my_rag_app", feedback_functions=[ groundedness.groundedness_measure_refusal, answer_relevance.answer_relevance, context_relevance.context_relevance ] ) # Run and get feedback with tru_context(): response = tru_chain(question) tru.run_dashboard() # View results

DeepEval

DeepEval is lightweight and modular. Good for teams wanting minimal dependencies.

from deepeval import evaluate from deepeval.metrics import Faithfulness, Relevance # Run tests test_cases = [ {"question": q, "answer": a, "context": c} for q, a, c in zip(questions, answers, contexts) ] metrics = [ Faithfulness(threshold=0.8), Relevance(threshold=0.85) ] results = evaluate(test_cases, metrics) print(f"Pass rate: {results['pass_rate']:.2%}")
06 — Data

Building Evaluation Datasets

Good evaluation datasets have golden Q&A pairs, retrieval groundtruth, and coverage of edge cases. Build incrementally as you find failures in production.

Dataset Structure

{ "question": "What is the difference between RAGAS and TruLens?", "answer": "RAGAS is more comprehensive with 5 metrics...", "contexts": [ "RAGAS measures context precision and recall...", "TruLens uses the RAG triad framework..." ], "ground_truth": "RAGAS has more granular metrics for retrieval", "retrieval_groundtruth": [0, 1], # which chunks are relevant "difficulty": "medium", "category": "comparison" }

Building Strategy

💡 Use stratified sampling. Ensure evaluation set reflects category/difficulty distribution of real queries. Random sampling misses rare but important cases.
07 — Automation

CI/CD for RAG Evals

Integrate evaluation into your deployment pipeline. Run evals on every PR; track metrics over time; alert on regressions.

GitHub Actions Example

name: RAG Evaluation on: [pull_request] jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: '3.10' - name: Install dependencies run: | pip install ragas datasets langchain - name: Run RAG evaluation env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python scripts/evaluate_rag.py - name: Check for regressions run: | python scripts/check_regressions.py \ --baseline results/main.json \ --current results/current.json \ --threshold 0.02 # 2% drop allowed - name: Comment on PR if: failure() uses: actions/github-script@v6 with: script: | github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: '⚠️ RAG evaluation shows regression. Check results.' })

Metrics Dashboard

Use tools like Arize, Weights & Biases, or LangSmith to track eval metrics over time. Watch for:

08 — Ecosystem

Evaluation Tools

RAGAS
RAG Evaluation
Standard RAG evaluation framework. 5 core metrics, synthetic test set generation, LLM-based scoring.
TruLens
Application Monitoring
Lightweight feedback framework. RAG triad metrics, explainability, production monitoring dashboard.
DeepEval
LLM Testing
Modular evaluation library. Minimal dependencies, supports RAG + general LLM metrics.
LangSmith
Debugging & Monitoring
LangChain's evaluation platform. Trace runs, evaluate on datasets, track metrics in production.
UpTrain
Data Quality
Open-source data quality framework. Checks for data leakage, semantic drift, factuality.
BERTScore
Generation Metrics
Semantic similarity scoring. Contextual embeddings; works for any language pair.
Arize Phoenix
LLM Observability
Production monitoring. Traces, embeddings, retrieval evaluation, drift detection.
Continuous Eval
Workflow Integration
Drop-in evaluation for any pipeline. Supports batch and streaming eval modes.
09 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing