RAG Evaluation

RAGAS

Reference-free RAG evaluation framework measuring Faithfulness, Answer Relevance, Context Precision and Context Recall — using LLM judges instead of human annotations to score your pipeline at scale.

4 core
Metrics
Reference-free
Evaluation
LLM-judged
Quality

Table of Contents

SECTION 01

What RAGAS measures

Traditional NLP evaluation (BLEU, ROUGE) compares generated text to a reference answer — which requires expensive human-written ground truth for every test case. RAGAS flips this: it uses an LLM judge to evaluate quality dimensions directly, requiring only questions and the retrieved context + generated answer your pipeline produced.

This means you can evaluate a RAG pipeline with just a list of realistic questions. No need to write expected answers by hand. The LLM judge (GPT-4, Claude, etc.) scores each response on the dimensions that actually matter for RAG: did the answer faithfully use the retrieved context? Was the right context retrieved? Does the answer actually address the question?

RAGAS works at two levels: component evaluation (how good is your retriever? how good is your generator?) and end-to-end evaluation (how good is the full pipeline for real user questions?).

SECTION 02

The 4 core metrics explained

Faithfulness: Does the answer make only claims that are supported by the retrieved context? Measures hallucination. Score 0–1. An answer that invents facts not in the context scores low, even if those facts happen to be true.

Answer Relevance: Does the answer actually address the question asked? Measures if the answer is on-topic. Score 0–1. An answer that's factually correct but addresses a different question scores low.

Context Precision: Are the retrieved chunks actually relevant to the question? Measures retrieval quality. Score 0–1. If you retrieve 5 chunks but only 1 is relevant, precision is low even if that 1 chunk is enough to answer.

Context Recall (requires reference answers): Did the retrieval find all the information needed to answer? Measures retrieval coverage. Needs ground-truth answers to compare against. Score 0–1.

The three reference-free metrics (Faithfulness + Answer Relevance + Context Precision) form the "RAG Triad" you can run with zero annotation cost.

SECTION 03

Setting up RAGAS

pip install ragas langchain-anthropic

import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# RAGAS expects a HuggingFace Dataset with these columns:
# - question: str
# - answer: str       (generated by your RAG pipeline)
# - contexts: list[str]  (chunks retrieved by your retriever)
# - ground_truth: str  (optional — needed for context_recall)

data = {
    "question": [
        "What is the capital of France?",
        "How does attention work in transformers?",
    ],
    "answer": [
        "The capital of France is Paris.",
        "Attention computes a weighted sum of values based on query-key similarity.",
    ],
    "contexts": [
        ["France is a country in Western Europe. Its capital city is Paris."],
        ["The attention mechanism allows models to focus on relevant parts of the input by computing similarity scores between queries and keys, then using these scores as weights over values."],
    ],
    "ground_truth": [
        "Paris is the capital of France.",
        "Attention uses queries, keys, and values to compute weighted sums.",
    ],
}
dataset = Dataset.from_dict(data)
SECTION 04

Evaluating a RAG pipeline end-to-end

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from ragas.llms import LangchainLLMWrapper
from langchain_anthropic import ChatAnthropic

# Use Claude as the judge LLM
judge_llm = LangchainLLMWrapper(ChatAnthropic(
    model="claude-haiku-4-5-20251001",
    temperature=0
))

# Evaluate — this calls the judge LLM once per metric per sample
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
    llm=judge_llm,
)
print(results)
# {'faithfulness': 0.97, 'answer_relevancy': 0.89, 'context_precision': 0.83}

# Get per-sample scores as a DataFrame
df = results.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy", "context_precision"]])

# Find your worst-performing questions
worst = df.nsmallest(5, "faithfulness")
print(worst[["question", "faithfulness", "answer_relevancy"]])

For real pipelines, generate the answer and contexts columns by running your actual RAG pipeline on each question before calling evaluate(). This lets you test any retriever/generator combination.

SECTION 05

Interpreting and acting on scores

Scores under 0.7 signal a serious problem. Scores 0.7–0.85 are typical for early-stage pipelines with room to improve. Above 0.9 across all metrics is production-ready for most use cases.

Low Faithfulness → your generator is hallucinating. Fix: add explicit instructions ("answer only using the provided context"), use a less creative model, or lower temperature. Low faithfulness despite good context means the LLM is ignoring retrieved evidence.

Low Answer Relevancy → your generated answers are off-topic or verbose. Fix: tighten your system prompt, add examples of good answers, reduce max_tokens to prevent rambling.

Low Context Precision → your retriever returns noisy chunks. Fix: increase similarity threshold, use a better embedding model, try hybrid retrieval (BM25 + vector), or add a re-ranker.

Always segment your evaluation dataset by question type (factual lookup vs. reasoning vs. comparison). A single average score hides category-specific failures.

SECTION 06

Custom metrics

from ragas.metrics.base import MetricWithLLM
from ragas.metrics import EvaluationMetric
from dataclasses import dataclass, field

# Example: measure if the answer cites specific data (custom requirement)
@dataclass
class CitationPresence(MetricWithLLM):
    name: str = "citation_presence"

    async def _ascore(self, row: dict, callbacks) -> float:
        question = row["question"]
        answer = row["answer"]

        prompt = (
            f"Does this answer include specific numbers, dates, or citations?
"
            f"Answer: {answer}

"
            f"Return only: 1 (yes) or 0 (no)"
        )

        response = await self.llm.agenerate([prompt])
        text = response.generations[0][0].text.strip()
        return 1.0 if "1" in text else 0.0

citation_metric = CitationPresence()

# Use with evaluate()
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, citation_metric],
    llm=judge_llm,
)
SECTION 07

Gotchas

The judge LLM's quality sets the ceiling. If you use a weak judge (GPT-3.5, a small local model), scores become unreliable — the judge can't distinguish subtle faithfulness violations. Use Claude 3 Haiku or GPT-4o-mini minimum; use Claude 3.5 Sonnet for high-stakes evaluations.

Faithfulness doesn't catch all hallucinations. RAGAS checks if claims in the answer are supported by retrieved context. It doesn't check if the retrieved context itself is accurate. A confident wrong answer backed by wrong retrieved chunks gets a high faithfulness score.

Context Recall requires good ground truth. Writing accurate reference answers is the bottleneck. Bad ground truth (incomplete or ambiguous) produces misleading recall scores. Use domain experts to write at least 50–100 reference answers for your most important question types.

Evaluation cost adds up. Each metric calls the judge LLM once per sample. 1000 questions × 3 metrics = 3000 LLM calls. Budget for this in your CI/CD pipeline and use cheaper judge models for fast feedback loops.

RAGAS Metrics Reference

RAGAS (Retrieval Augmented Generation Assessment) provides a suite of reference-free metrics for evaluating RAG pipelines end-to-end. Unlike traditional NLP evaluation that requires gold-standard answers, RAGAS uses the LLM itself to judge response quality across multiple dimensions, making it practical for production evaluation where ground truth is unavailable.

MetricMeasuresRangeLow Score Means
FaithfulnessFacts in answer supported by context0–1Hallucination present
Answer RelevanceAnswer addresses the question0–1Off-topic or incomplete
Context PrecisionRetrieved chunks are relevant0–1Noisy retrieval
Context RecallAll relevant info was retrieved0–1Missing critical chunks
Answer CorrectnessFactual accuracy vs. ground truth0–1Wrong facts (needs GT)

Faithfulness is typically the most actionable metric for improving RAG systems. A low faithfulness score indicates that the LLM is generating claims not supported by the retrieved context — hallucinating facts rather than grounding its answer in the retrieved documents. Common causes include a context window too small to hold all retrieved chunks, retrieved chunks that partially address the question but leave gaps the model fills with parametric knowledge, and aggressive summarization that compresses away key supporting details.

Running RAGAS evaluations in a CI pipeline requires a curated test dataset of question-context-answer triples. Even without gold-standard answers, question-context pairs drawn from representative production queries provide meaningful signal. Tracking the score distribution over time — rather than a single aggregate number — reveals whether quality improvements on one metric come at the cost of another, which is common when tuning retrieval parameters.

Calibrating RAGAS evaluator prompts for your domain is an often-overlooked step that significantly affects score reliability. The default faithfulness evaluator prompt asks a general-purpose LLM to judge whether each claim in the answer is supported by the context. For highly technical domains — medical, legal, financial — the evaluator LLM may incorrectly judge domain-specific terminology as unsupported if it lacks domain knowledge. Using a stronger evaluator model or adding domain context to the evaluator prompt improves calibration and reduces false faithfulness failures.

RAGAS batch evaluation is typically run asynchronously to avoid blocking application response time. A common architecture uses a sampling approach: a random sample of production traces (1–5% of traffic) is sent to the evaluation pipeline in the background, results are stored in a time-series database, and alert rules trigger when any metric drops below threshold over a rolling window. This continuous evaluation loop catches quality regressions within hours of a bad deployment without requiring dedicated evaluation infrastructure running at full production volume.

RAGAS scores are most meaningful when compared to a baseline established at the start of a project rather than interpreted as absolute quality measures. A faithfulness score of 0.85 on a technical documentation RAG system may represent excellent performance for the domain, while the same score on a general-purpose FAQ bot might indicate significant room for improvement. Establishing domain-specific quality thresholds through human evaluation calibration is essential before using RAGAS scores to make deployment decisions.