LangSmith

What LangSmith solves
Setting up tracing
Exploring traces for debugging
Building evaluation datasets
Running automated evaluations
Production monitoring and alerts
Gotchas

SECTION 01

What LangSmith solves

LLM applications fail in ways that are invisible without observability tooling. A user reports "the chatbot gave a wrong answer" — but which LLM call failed? Was it the retrieval step that pulled the wrong context, the reranker that ranked them poorly, or the generator that ignored good context? Without traces, debugging requires reproducing the issue manually and adding print statements.

LangSmith traces every LLM call, tool invocation, and chain step — automatically, with no code changes needed if you use LangChain. Each trace shows the full input/output of every step, latency, token counts, and cost. You can filter to failed traces, share traces with teammates, and jump directly to the problematic step.

Beyond debugging, LangSmith enables: dataset curation (collect real user queries from production traces), regression testing (run your eval dataset on every prompt change), and prompt management (version-control prompts and A/B test them).

SECTION 02

Setting up tracing

pip install langsmith langchain-anthropic

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_..."
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"  # groups traces

# That's it — all LangChain calls are now traced automatically.
# For non-LangChain code, use the @traceable decorator:

from langsmith import traceable
import anthropic

client = anthropic.Anthropic()

@traceable(name="claude_call", run_type="llm")
def call_claude(prompt: str, system: str = "") -> str:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

@traceable(name="rag_pipeline")
def answer_question(question: str, context: str) -> str:
    system = "Answer only using the provided context."
    prompt = f"Context:
{context}

Question: {question}"
    return call_claude(prompt, system)

# This creates a trace with nested spans:
# rag_pipeline → claude_call (with full input/output, tokens, latency)
result = answer_question("What is RAGAS?", "RAGAS is a RAG evaluation framework...")

SECTION 03

Exploring traces for debugging

In the LangSmith UI, each trace shows a waterfall of spans. For a RAG pipeline you'll see something like:

rag_pipeline (1.2s total)
├── retriever (0.3s) — query: "what is RAGAS?"
│   └── vector_search: returned 3 chunks
├── reranker (0.4s) — reranked to top 2
└── claude_call (0.5s)
    ├── Input: "Context:
[chunk 1]
[chunk 2]
Question: what is RAGAS?"
    ├── Output: "RAGAS is a reference-free RAG evaluation framework..."
    └── Tokens: 312 prompt + 89 completion = 401 total, $0.0008

Filter techniques that save hours: filter by error=true to see only failed runs; sort by latency descending to find slow outliers; use metadata filtering to see traces for a specific user or session. Click any span to see the exact prompt — copy it to test in the playground.

# Add metadata to traces for filtering
from langsmith import traceable
from langchain_core.tracers.context import tracing_v2_enabled

@traceable(metadata={"user_id": "u123", "session_id": "s456"})
def handle_request(query: str) -> str:
    return answer_question(query, retrieve(query))

SECTION 04

Building evaluation datasets

from langsmith import Client

client = Client()

# Create a dataset from scratch
dataset = client.create_dataset(
    "rag-eval-v1",
    description="RAG pipeline evaluation questions"
)

# Add examples manually
client.create_examples(
    inputs=[
        {"question": "What is constitutional AI?"},
        {"question": "How does GGUF quantization work?"},
    ],
    outputs=[
        {"answer": "Constitutional AI is Anthropic's method for training helpful, harmless assistants using principles-guided self-critique."},
        {"answer": "GGUF quantizes model weights to 4-bit or 8-bit integers, reducing memory by 4-8× with minimal quality loss."},
    ],
    dataset_id=dataset.id
)

# Or populate from production traces — find good/bad examples:
runs = client.list_runs(
    project_name="my-rag-app",
    execution_order=1,  # top-level runs only
    filter='and(gt(total_tokens, 100), lt(latency, 2.0))',
)
for run in list(runs)[:20]:
    # Add to dataset with human-reviewed label
    client.create_examples(
        inputs=[run.inputs],
        outputs=[run.outputs],
        dataset_id=dataset.id
    )

SECTION 05

Running automated evaluations

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator

client = Client()

# Define what you're evaluating
def rag_pipeline(inputs: dict) -> dict:
    question = inputs["question"]
    context = retrieve(question)  # your retriever
    answer = answer_question(question, context)
    return {"answer": answer}

# Define evaluators
correctness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "correctness": "Is the answer factually correct based on the reference answer?"
        }
    }
)

# Run evaluation — calls your pipeline on every dataset example
results = evaluate(
    rag_pipeline,
    data="rag-eval-v1",
    evaluators=[correctness_evaluator],
    experiment_prefix="prompt-v2",  # tag this run
    max_concurrency=4,
)
print(results.to_pandas()[["input.question", "feedback.correctness"]].head())

# Compare experiments in the LangSmith UI: prompt-v1 vs prompt-v2

SECTION 06

Production monitoring and alerts

from langsmith import Client

client = Client()

# Query recent production metrics
project_stats = client.read_project(project_name="my-rag-app")
print(f"Total runs today: {project_stats.run_count}")
print(f"Error rate: {project_stats.error_rate:.1%}")
print(f"Avg latency: {project_stats.latency_p50:.2f}s (p50)")
print(f"Total tokens: {project_stats.total_tokens}")

# Find high-latency runs to investigate
slow_runs = client.list_runs(
    project_name="my-rag-app",
    filter="gt(latency, 5.0)",  # >5 seconds
    limit=10
)

# Add human feedback programmatically (e.g. thumbs up/down from UI)
client.create_feedback(
    run_id="run-uuid-here",
    key="user_rating",
    score=1.0,  # or 0.0 for negative
    comment="User clicked thumbs up"
)

# Fetch feedback stats — track quality trends over time
feedback_stats = client.list_feedback(
    run_ids=[r.id for r in client.list_runs(project_name="my-rag-app", limit=100)],
    feedback_key="user_rating"
)

Set up LangSmith automations (in the UI) to alert via email or Slack when error rate exceeds a threshold or when latency p95 spikes.

SECTION 07

Gotchas

Tracing slows down local development if the network is slow. LangSmith traces are sent asynchronously, so in theory they don't add latency. In practice, if the LangSmith API is slow or you're on a bad connection, you'll notice delays. Set LANGCHAIN_TRACING_V2=false in unit tests and only enable tracing in integration tests and production.

PII ends up in traces by default. Every input/output is stored. If your users send personal information, configure redaction before enabling tracing in production. Use the hide_inputs/hide_outputs options on @traceable decorators for sensitive spans, or implement a custom redaction callback.

Dataset drift makes old evals misleading. Evaluation datasets built from early production traffic may not represent current user behaviour. Refresh your dataset quarterly by sampling recent production traces, especially after product changes that attract new user segments.

LangSmith Observability Features

LangSmith provides tracing, evaluation, and dataset management for LLM applications built with LangChain or any Python/TypeScript application using the LangSmith SDK. It captures the full execution trace of an LLM pipeline — including every prompt sent, every model response received, and every tool call made — enabling both real-time debugging and offline evaluation workflows.

Feature	Purpose	Key Metric	Use Case
Tracing	Full pipeline visibility	Latency per step	Debugging, optimization
Datasets	Curated test cases	Coverage, diversity	Regression testing
Evaluators	Automated quality scoring	Pass rate, score distribution	CI/CD quality gates
Human Annotation	Manual quality labeling	Agreement rate	Ground truth creation
Playground	Interactive prompt testing	—	Prompt engineering

LangSmith's dataset management workflow supports a virtuous cycle of quality improvement. Production traces that expose failures or unexpected behavior can be saved directly to evaluation datasets with one click. These datasets feed into automated evaluators that run on every code change, preventing regressions from reaching production. Over time, the dataset grows to cover an increasingly representative sample of real-world inputs, making the evaluation suite progressively more reliable as a quality signal.

The evaluator framework in LangSmith supports both reference-based evaluation (comparing model output to a gold standard answer) and reference-free evaluation (judging output quality without a ground truth). LLM-as-judge evaluators are particularly useful for assessing dimensions like helpfulness, tone, and factual accuracy that are difficult to measure with string matching. Custom evaluators can be defined as Python functions that receive the input, output, and optional reference, returning a numeric score and optional reasoning string.

LangSmith's online evaluation feature allows evaluators to run automatically against production traffic samples as requests arrive, providing continuous quality monitoring without requiring explicit test dataset maintenance. By sampling a configurable percentage of production traces and running them through an LLM-as-judge evaluator, teams can detect quality regressions within hours of a deployment rather than waiting for user complaints or manual review cycles to surface problems.

Prompt versioning in LangSmith's Hub stores named, versioned prompt templates alongside their evaluation results. When a prompt change improves average evaluation scores, the new version can be tagged as the production version and pulled directly in application code via the Hub API. This creates a complete audit trail of prompt evolution: who changed what, when, and what the measured quality impact was — closing the feedback loop between prompt engineering experimentation and production deployment.