LangChain's platform for debugging, testing, and monitoring LLM applications — trace every LLM call, build evaluation datasets from production traffic, run regression tests on prompts, and track quality over time.
LLM applications fail in ways that are invisible without observability tooling. A user reports "the chatbot gave a wrong answer" — but which LLM call failed? Was it the retrieval step that pulled the wrong context, the reranker that ranked them poorly, or the generator that ignored good context? Without traces, debugging requires reproducing the issue manually and adding print statements.
LangSmith traces every LLM call, tool invocation, and chain step — automatically, with no code changes needed if you use LangChain. Each trace shows the full input/output of every step, latency, token counts, and cost. You can filter to failed traces, share traces with teammates, and jump directly to the problematic step.
Beyond debugging, LangSmith enables: dataset curation (collect real user queries from production traces), regression testing (run your eval dataset on every prompt change), and prompt management (version-control prompts and A/B test them).
pip install langsmith langchain-anthropic
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_..."
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app" # groups traces
# That's it — all LangChain calls are now traced automatically.
# For non-LangChain code, use the @traceable decorator:
from langsmith import traceable
import anthropic
client = anthropic.Anthropic()
@traceable(name="claude_call", run_type="llm")
def call_claude(prompt: str, system: str = "") -> str:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
@traceable(name="rag_pipeline")
def answer_question(question: str, context: str) -> str:
system = "Answer only using the provided context."
prompt = f"Context:
{context}
Question: {question}"
return call_claude(prompt, system)
# This creates a trace with nested spans:
# rag_pipeline → claude_call (with full input/output, tokens, latency)
result = answer_question("What is RAGAS?", "RAGAS is a RAG evaluation framework...")
In the LangSmith UI, each trace shows a waterfall of spans. For a RAG pipeline you'll see something like:
rag_pipeline (1.2s total)
├── retriever (0.3s) — query: "what is RAGAS?"
│ └── vector_search: returned 3 chunks
├── reranker (0.4s) — reranked to top 2
└── claude_call (0.5s)
├── Input: "Context:
[chunk 1]
[chunk 2]
Question: what is RAGAS?"
├── Output: "RAGAS is a reference-free RAG evaluation framework..."
└── Tokens: 312 prompt + 89 completion = 401 total, $0.0008
Filter techniques that save hours: filter by error=true to see only failed runs; sort by latency descending to find slow outliers; use metadata filtering to see traces for a specific user or session. Click any span to see the exact prompt — copy it to test in the playground.
# Add metadata to traces for filtering
from langsmith import traceable
from langchain_core.tracers.context import tracing_v2_enabled
@traceable(metadata={"user_id": "u123", "session_id": "s456"})
def handle_request(query: str) -> str:
return answer_question(query, retrieve(query))
from langsmith import Client
client = Client()
# Create a dataset from scratch
dataset = client.create_dataset(
"rag-eval-v1",
description="RAG pipeline evaluation questions"
)
# Add examples manually
client.create_examples(
inputs=[
{"question": "What is constitutional AI?"},
{"question": "How does GGUF quantization work?"},
],
outputs=[
{"answer": "Constitutional AI is Anthropic's method for training helpful, harmless assistants using principles-guided self-critique."},
{"answer": "GGUF quantizes model weights to 4-bit or 8-bit integers, reducing memory by 4-8× with minimal quality loss."},
],
dataset_id=dataset.id
)
# Or populate from production traces — find good/bad examples:
runs = client.list_runs(
project_name="my-rag-app",
execution_order=1, # top-level runs only
filter='and(gt(total_tokens, 100), lt(latency, 2.0))',
)
for run in list(runs)[:20]:
# Add to dataset with human-reviewed label
client.create_examples(
inputs=[run.inputs],
outputs=[run.outputs],
dataset_id=dataset.id
)
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
client = Client()
# Define what you're evaluating
def rag_pipeline(inputs: dict) -> dict:
question = inputs["question"]
context = retrieve(question) # your retriever
answer = answer_question(question, context)
return {"answer": answer}
# Define evaluators
correctness_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={
"criteria": {
"correctness": "Is the answer factually correct based on the reference answer?"
}
}
)
# Run evaluation — calls your pipeline on every dataset example
results = evaluate(
rag_pipeline,
data="rag-eval-v1",
evaluators=[correctness_evaluator],
experiment_prefix="prompt-v2", # tag this run
max_concurrency=4,
)
print(results.to_pandas()[["input.question", "feedback.correctness"]].head())
# Compare experiments in the LangSmith UI: prompt-v1 vs prompt-v2
from langsmith import Client
client = Client()
# Query recent production metrics
project_stats = client.read_project(project_name="my-rag-app")
print(f"Total runs today: {project_stats.run_count}")
print(f"Error rate: {project_stats.error_rate:.1%}")
print(f"Avg latency: {project_stats.latency_p50:.2f}s (p50)")
print(f"Total tokens: {project_stats.total_tokens}")
# Find high-latency runs to investigate
slow_runs = client.list_runs(
project_name="my-rag-app",
filter="gt(latency, 5.0)", # >5 seconds
limit=10
)
# Add human feedback programmatically (e.g. thumbs up/down from UI)
client.create_feedback(
run_id="run-uuid-here",
key="user_rating",
score=1.0, # or 0.0 for negative
comment="User clicked thumbs up"
)
# Fetch feedback stats — track quality trends over time
feedback_stats = client.list_feedback(
run_ids=[r.id for r in client.list_runs(project_name="my-rag-app", limit=100)],
feedback_key="user_rating"
)
Set up LangSmith automations (in the UI) to alert via email or Slack when error rate exceeds a threshold or when latency p95 spikes.
Tracing slows down local development if the network is slow. LangSmith traces are sent asynchronously, so in theory they don't add latency. In practice, if the LangSmith API is slow or you're on a bad connection, you'll notice delays. Set LANGCHAIN_TRACING_V2=false in unit tests and only enable tracing in integration tests and production.
PII ends up in traces by default. Every input/output is stored. If your users send personal information, configure redaction before enabling tracing in production. Use the hide_inputs/hide_outputs options on @traceable decorators for sensitive spans, or implement a custom redaction callback.
Dataset drift makes old evals misleading. Evaluation datasets built from early production traffic may not represent current user behaviour. Refresh your dataset quarterly by sampling recent production traces, especially after product changes that attract new user segments.
LangSmith provides tracing, evaluation, and dataset management for LLM applications built with LangChain or any Python/TypeScript application using the LangSmith SDK. It captures the full execution trace of an LLM pipeline — including every prompt sent, every model response received, and every tool call made — enabling both real-time debugging and offline evaluation workflows.
| Feature | Purpose | Key Metric | Use Case |
|---|---|---|---|
| Tracing | Full pipeline visibility | Latency per step | Debugging, optimization |
| Datasets | Curated test cases | Coverage, diversity | Regression testing |
| Evaluators | Automated quality scoring | Pass rate, score distribution | CI/CD quality gates |
| Human Annotation | Manual quality labeling | Agreement rate | Ground truth creation |
| Playground | Interactive prompt testing | — | Prompt engineering |
LangSmith's dataset management workflow supports a virtuous cycle of quality improvement. Production traces that expose failures or unexpected behavior can be saved directly to evaluation datasets with one click. These datasets feed into automated evaluators that run on every code change, preventing regressions from reaching production. Over time, the dataset grows to cover an increasingly representative sample of real-world inputs, making the evaluation suite progressively more reliable as a quality signal.
The evaluator framework in LangSmith supports both reference-based evaluation (comparing model output to a gold standard answer) and reference-free evaluation (judging output quality without a ground truth). LLM-as-judge evaluators are particularly useful for assessing dimensions like helpfulness, tone, and factual accuracy that are difficult to measure with string matching. Custom evaluators can be defined as Python functions that receive the input, output, and optional reference, returning a numeric score and optional reasoning string.
LangSmith's online evaluation feature allows evaluators to run automatically against production traffic samples as requests arrive, providing continuous quality monitoring without requiring explicit test dataset maintenance. By sampling a configurable percentage of production traces and running them through an LLM-as-judge evaluator, teams can detect quality regressions within hours of a deployment rather than waiting for user complaints or manual review cycles to surface problems.
Prompt versioning in LangSmith's Hub stores named, versioned prompt templates alongside their evaluation results. When a prompt change improves average evaluation scores, the new version can be tagged as the production version and pulled directly in application code via the Hub API. This creates a complete audit trail of prompt evolution: who changed what, when, and what the measured quality impact was — closing the feedback loop between prompt engineering experimentation and production deployment.