Langfuse

What Langfuse tracks
Instrumenting with the Python SDK
Scoring and evaluation
Prompt management
Cost tracking
Self-hosting
Gotchas

SECTION 01

What Langfuse tracks

Langfuse is an observability platform built specifically for LLM applications. Unlike general APM tools, it understands the structure of LLM calls: prompts, completions, token counts, model names, costs, and latency — all as first-class data.

The core data model: traces (a single user request, containing one or more spans) and generations (individual LLM API calls within a trace). A RAG pipeline trace might contain spans for retrieval, re-ranking, and generation — each tracked separately with their own latency and metadata.

Langfuse adds scores on top: you can attach numerical or categorical scores to any trace or generation, from human feedback, automated evals, or LLM-as-judge, and analyse them over time in the dashboard.

SECTION 02

Instrumenting with the Python SDK

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com",  # or your self-hosted URL
)

@observe()  # automatically creates a trace
def rag_pipeline(question: str) -> str:
    # Each decorated function becomes a span in the trace
    context = retrieve_context(question)
    answer = call_llm(question, context)
    return answer

@observe(name="llm-call")
def call_llm(question: str, context: str) -> str:
    # Manually log generation details
    langfuse_context.update_current_observation(
        input={"question": question, "context": context[:200]},
        model="gpt-4o-mini",
        usage={"input": 150, "output": 80},
    )
    response = openai_client.chat.completions.create(...)
    return response.choices[0].message.content

# Call the pipeline — trace is automatically sent to Langfuse
result = rag_pipeline("What is RAG?")
langfuse.flush()  # ensure all events are sent

SECTION 03

Scoring and evaluation

from langfuse import Langfuse

langfuse = Langfuse(...)

# Score a trace manually (e.g. from human review)
langfuse.score(
    trace_id="trace-abc123",
    name="human-rating",
    value=4.0,          # numeric score
    comment="Good answer but missing one detail",
)

# Automated LLM-as-judge scoring
def score_trace_with_llm(trace_id: str, output: str, expected: str):
    prompt = f"Rate this answer 0-10 for correctness.
Expected: {expected}
Actual: {output}"
    score = float(openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content.strip())
    langfuse.score(trace_id=trace_id, name="llm-judge-correctness", value=score)

# Batch evaluation over recent traces
traces = langfuse.fetch_traces(limit=100, tags=["production"]).data
for trace in traces:
    score_trace_with_llm(trace.id, trace.output, expected_outputs[trace.id])

SECTION 04

Prompt management

from langfuse import Langfuse

langfuse = Langfuse(...)

# Create a prompt in the Langfuse UI or via SDK
langfuse.create_prompt(
    name="rag-answer-prompt",
    prompt="Answer the question based only on the context below.

Context: {{context}}

Question: {{question}}

Answer:",
    labels=["production"],
    config={"model": "gpt-4o-mini", "temperature": 0.1},
)

# Fetch and use the prompt — SDK caches with TTL
prompt = langfuse.get_prompt("rag-answer-prompt", label="production")
formatted = prompt.compile(context=retrieved_docs, question=user_question)

response = openai_client.chat.completions.create(
    model=prompt.config["model"],
    messages=[{"role": "user", "content": formatted}],
    temperature=prompt.config["temperature"],
)

# Langfuse links the generation to the prompt version automatically

Prompt versioning means you can A/B test prompt changes and see exactly which prompt version each trace used.

SECTION 05

Cost tracking

Langfuse automatically calculates costs if you pass model name and token counts with each generation. It maintains a model cost table (updated for major providers) and shows cumulative cost per trace, per user, and per project over time.

# Log token usage and cost data
generation = langfuse.generation(
    trace_id=trace.id,
    name="final-answer",
    model="gpt-4o-mini",
    model_parameters={"temperature": 0.1, "max_tokens": 500},
    input=[{"role": "user", "content": prompt}],
    output=response_text,
    usage={
        "input": completion.usage.prompt_tokens,
        "output": completion.usage.completion_tokens,
        "total": completion.usage.total_tokens,
        # cost is computed automatically from model pricing table
    },
)

The dashboard shows cost breakdowns by model, time period, and user — essential for understanding where your LLM budget is going.

SECTION 06

Self-hosting

# Self-host with Docker Compose
git clone https://github.com/langfuse/langfuse.git
cd langfuse
cp .env.example .env
# Edit .env: set NEXTAUTH_SECRET, SALT, DATABASE_URL (PostgreSQL)

docker compose up -d
# Langfuse UI at http://localhost:3000

Langfuse uses PostgreSQL for storage (traces, scores, prompts) and runs as a single service. For production self-hosting: use a managed Postgres (RDS, Cloud SQL), set up proper authentication, and configure SMTP for user invitations.

The cloud version (cloud.langfuse.com) has a generous free tier (50k observations/month). For data privacy requirements, self-hosting is the right choice — all your prompts and completions stay on your infrastructure.

SECTION 07

Gotchas

Async flushing: Langfuse sends events asynchronously. In short-lived scripts (serverless functions, one-off scripts), always call langfuse.flush() at the end, otherwise events may be lost.

Nested trace context: The @observe() decorator automatically threads trace context through nested calls. If you use threading or async tasks, use langfuse.trace() manually and pass trace IDs explicitly.

Token counting: If your inference server doesn't return token counts (e.g. local models via Ollama), you need to compute them client-side with a tokenizer. Without usage data, cost calculations are unavailable.

Score names: Define a consistent vocabulary for score names across your team (e.g. "correctness", "faithfulness", "relevance") rather than letting names proliferate. Inconsistent names make dashboard analysis painful.

LangFuse Observability Architecture

LangFuse is an open-source LLM observability platform that provides detailed tracing, evaluation, and dataset management for production AI applications. Unlike hosted-only solutions, LangFuse can be self-hosted, making it suitable for organizations with strict data privacy requirements that prohibit sending production inference data to third-party services.

Feature	Purpose	Self-Hosted	Integration
Traces	Full pipeline visibility	Yes	SDK / LangChain / LlamaIndex
Scores	Quality annotation	Yes	API / UI
Datasets	Evaluation sets	Yes	Python SDK
Prompt Management	Versioned prompt registry	Yes	SDK pull/push
User Feedback	Thumbs up/down collection	Yes	Frontend widget

LangFuse's trace data model uses a three-level hierarchy: traces (top-level, one per user request), spans (intermediate steps within a trace), and generations (individual LLM calls). This hierarchy maps naturally to RAG and agent pipelines: the trace captures the full user interaction, retrieval and preprocessing steps are spans, and each LLM call is a generation with automatically captured token counts, latency, and cost. The nested structure enables drill-down analysis from high-level quality trends to individual generation-level debugging.

Prompt management in LangFuse enables version-controlled prompt deployment with rollback capability. Prompts are stored in the LangFuse registry with semantic version tags; application code fetches the current production version at runtime via the SDK rather than hardcoding prompt text. When a prompt change is needed, a new version is created, tested against evaluation datasets, and promoted to production without any code deployment. This decouples prompt iteration cycles from software release cycles, allowing product teams to improve prompt quality without engineering involvement in routine cases.

LangFuse integrates natively with LangChain and LlamaIndex through callback handlers that automatically capture the full execution trace of any chain or agent without requiring manual instrumentation code. Wrapping a chain with the LangFuse callback handler adds a single line to the existing code, and LangFuse captures every LLM call, tool invocation, and retrieval operation with their inputs, outputs, and latencies. For teams already using LangChain or LlamaIndex, this zero-effort integration path makes LangFuse one of the fastest observability solutions to get running in production.

LangFuse scores can be attached to traces programmatically from any part of the application stack — the backend API that processes requests, a separate evaluation microservice, or a human feedback collection frontend. The score data model supports numeric scores (for automated metrics with continuous ranges), boolean scores (for pass/fail assertions), and categorical scores (for labeled quality dimensions). Mixing automated scores from LLM judges with human annotation scores in the same data store enables analysis of how well automated evaluation correlates with human judgment for calibrating the automated metrics.

LangFuse's self-hosted deployment uses Docker Compose for single-server installations or Kubernetes Helm charts for production-scale deployments. The backend stores trace data in PostgreSQL (for structured metadata and scores) and ClickHouse (for high-volume, time-series aggregation queries). The ClickHouse component handles the analytical workloads — aggregating metrics over time, computing percentile latency distributions, and identifying cost trends — that would be prohibitively slow on a transactional PostgreSQL database with millions of trace records.

Team collaboration in LangFuse uses project-based access control where team members are assigned roles (owner, member, viewer) with corresponding permissions. Owners manage project settings and API keys; members add scores and manage datasets; viewers can inspect traces and results without modifying evaluation data. This role hierarchy supports workflows where data scientists configure evaluation pipelines, product managers review quality trends in the dashboard, and developers debug specific trace failures — each role accessing the information they need without requiring full platform access.