Eval Frameworks

TruLens

TruLens is an LLM evaluation and observability library by TruEra. Instrument any LLM app with the TruChain or TruLlama wrappers, evaluate RAG quality with the RAG Triad (context relevance + groundedness + answer relevance), and view results in a local Streamlit dashboard.

RAG Triad
Core eval framework
TruChain
LangChain support
Local
Dashboard

Table of Contents

SECTION 01

The RAG Triad

TruLens is best known for the RAG Triad — three metrics that together characterise RAG pipeline quality. The insight: RAG failures come from three distinct sources, each requiring a different measurement.

All three need to be high for a healthy RAG system. A common failure mode: high answer relevance but low groundedness — the model gives relevant-sounding answers by hallucinating rather than grounding in retrieved facts.

SECTION 02

Instrumenting a LangChain app

from trulens.apps.langchain import TruChain
from trulens.core import TruSession
from trulens.providers.openai import OpenAI as TruOpenAI
from trulens import Feedback
import numpy as np

session = TruSession()
session.reset_database()

# Define feedback functions using the RAG Triad
provider = TruOpenAI(model_engine="gpt-4o-mini")

f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()          # question
    .on(TruChain.select_context("rag_chain"))  # retrieved context
    .aggregate(np.mean)
)
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruChain.select_context("rag_chain"))
    .on_output()         # final answer
    .aggregate(np.mean)
)
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()

# Wrap your LangChain app
tru_rag = TruChain(
    rag_chain,
    app_name="my-rag",
    app_version="v1",
    feedbacks=[f_context_relevance, f_groundedness, f_answer_relevance],
)

# Run queries — TruLens captures traces and runs evals
with tru_rag as recording:
    response = rag_chain.invoke({"question": "What is RAG?"})
SECTION 03

Instrumenting LlamaIndex

from trulens.apps.llamaindex import TruLlama
from trulens.core import TruSession
from trulens.providers.openai import OpenAI as TruOpenAI
from trulens import Feedback

session = TruSession()
provider = TruOpenAI()

# Same feedback functions — TruLlama selects LlamaIndex-specific context
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)  # LlamaIndex node text
    .aggregate(np.mean)
)

tru_query_engine = TruLlama(
    query_engine,
    app_name="llamaindex-rag",
    app_version="v1",
    feedbacks=[f_context_relevance, f_groundedness, f_answer_relevance],
)

# Batch evaluation over test set
with tru_query_engine as recording:
    for q in test_questions:
        tru_query_engine.query(q)

# View results
session.get_leaderboard(app_ids=["llamaindex-rag"])
SECTION 04

Feedback functions

TruLens feedback functions are Python callables that take some subset of (input, output, context) and return a float score 0–1. Built-in providers: OpenAI, Anthropic, Hugging Face (for open-source judge models), and Bedrock.

from trulens import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o-mini")

# Built-in feedback functions
provider.relevance           # answer relevance to question
provider.context_relevance   # context relevance to question
provider.groundedness_measure_with_cot_reasons  # answer grounded in context
provider.coherence           # logical coherence of response
provider.harmlessness        # safety check
provider.conciseness         # penalise verbosity

# Custom feedback function
def sql_is_valid(query: str, sql_output: str) -> float:
    try:
        import sqlparse; sqlparse.parse(sql_output)[0]
        return 1.0
    except: return 0.0

f_sql = Feedback(sql_is_valid, name="SQL Validity").on_input_output()
SECTION 05

Running evaluations

from trulens.core import TruSession

session = TruSession()

# Get results for a specific app
records, feedback_data = session.get_records_and_feedback(app_ids=["my-rag"])

# Show performance summary
leaderboard = session.get_leaderboard()
print(leaderboard[["app_id", "app_version", "Context Relevance", "Groundedness", "Answer Relevance", "latency", "total_cost"]])

# Export for further analysis
import pandas as pd
df = records.merge(feedback_data, on="record_id")
low_groundedness = df[df["Groundedness"] < 0.5]
print(f"Records with hallucination: {len(low_groundedness)}")
print(low_groundedness[["input", "output", "Groundedness"]].head(10))
SECTION 06

The TruLens dashboard

from trulens.dashboard import run_dashboard

# Launch the Streamlit dashboard locally
run_dashboard(session)
# Opens http://localhost:8501

The TruLens dashboard shows: a leaderboard comparing app versions across all metrics, individual trace inspection (click any record to see retrieved context, answer, and per-metric scores with reasoning), latency distributions, and cost breakdowns.

The leaderboard is especially useful when comparing RAG configurations — different embedding models, chunk sizes, or retrievers — against each other on the same test set. Sort by "Groundedness" to find which configuration hallucinates least.

SECTION 07

Gotchas

Database persistence: By default, TruLens uses a local SQLite database. For shared team access or persistence across sessions, use PostgreSQL: TruSession(database_url="postgresql://...").

Feedback computation timing: By default, feedback is computed asynchronously after the app call. For small test runs, call session.wait_for_feedback_results() before reading scores.

Context selection: The select_context or select_source_nodes selectors are framework-specific. If your RAG app has a non-standard structure, use Select.RecordCalls to manually navigate the call tree and select the context variable.

Cost of evaluation: Running the full RAG Triad on 100 records with GPT-4o-mini takes ~300 LLM calls and costs around $0.10–$0.30. Budget this when designing CI eval suites.

TruLens Evaluation Metrics Reference

TruLens provides an evaluation and observability layer for LLM applications that tracks inputs, outputs, and intermediate steps while computing automated quality metrics. It supports both RAG pipeline evaluation and general LLM application monitoring through a unified feedback function interface.

Feedback FunctionMeasuresRequires Ground TruthLLM Judge Needed
GroundednessAnswer supported by contextNoYes
Context RelevanceRetrieved docs match queryNoYes
Answer RelevanceAnswer addresses queryNoYes
CorrectnessFactual accuracyYes (reference)Optional
ToxicityHarmful content presenceNoVia classifier

TruLens uses a provider-agnostic feedback function architecture where the same evaluation logic can be backed by different LLM providers. The Groundedness feedback function, for example, can use OpenAI, Anthropic, or a local model as the judging LLM, making it flexible for deployments with specific provider constraints or cost requirements. Evaluating with a different provider than the one used for generation also reduces the risk of self-serving evaluation where a model is judged by a model with the same biases.

TruLens Eval's leaderboard interface aggregates evaluation results across multiple experiment runs, making it easy to compare prompt variants, model choices, and retrieval configurations side by side. Each tracked application version receives aggregate scores for each feedback function, enabling data-driven decisions about which configuration to promote to production. The dashboard also surfaces individual low-scoring examples for each metric, enabling human review of specific failure cases without sifting through logs manually.

TruLens RAG Triad evaluation framework packages the three most important RAG quality metrics — groundedness, context relevance, and answer relevance — as a cohesive evaluation suite. Running all three metrics on every RAG response provides a comprehensive quality picture: context relevance catches retrieval problems, groundedness catches hallucination problems, and answer relevance catches relevance and completeness problems. Tracking all three simultaneously reveals when optimizing for one metric degrades another, which is common when tuning retrieval parameters.

TruLens' asynchronous evaluation mode runs feedback functions after responses are returned to users, avoiding the latency overhead of synchronous evaluation in the critical path. The response is returned immediately while evaluation results are computed in background workers and associated with the logged trace. This architecture enables evaluation coverage of 100% of production traffic without adding any user-perceived latency. When evaluating with a slower or more expensive judge model, the asynchronous approach is essential for maintaining production latency SLOs.

Custom feedback functions in TruLens extend the pre-built metrics with application-specific quality dimensions. A feedback function is a Python function that accepts the application inputs, outputs, and optional ground truth, returning a score between 0 and 1 and an optional explanation string. Custom functions can call any external service — a domain-specific classifier, a business rules engine, a human annotation API — making TruLens's evaluation framework extensible to quality dimensions that are specific to an application's domain and cannot be captured by generic LLM-as-judge prompts.

Multi-modal evaluation support in TruLens extends groundedness and relevance assessment to image and multimodal LLM pipelines. For vision RAG systems that retrieve image documents alongside text and generate multimodal answers, TruLens can evaluate whether the answer's visual claims are grounded in the retrieved images using a multimodal LLM judge. This extension of the RAG Triad to multimodal pipelines is increasingly important as production applications incorporate image retrieval, document scanning, and chart analysis into their RAG architectures.