02 — Experiment
MLOps & Experiment Tracking
Experiment tracking for AI is non-negotiable. You need to log: prompts, parameters, results, costs, latencies, human evaluations. Track runs so you can compare "which prompt worked best?" without guessing.
What to Track
| Dimension | Why it matters | Example |
| Prompts |
Small changes = big results |
"Always reason step-by-step" vs without |
| Parameters |
Temperature, max_tokens, stop sequences |
T=0 vs T=1 for same task |
| Model version |
Claude 3.5 Sonnet vs Haiku |
Latency, cost, quality tradeoff |
| Evaluation scores |
How good is this output? Auto or human |
BLEU, exact match, human rating |
| Costs |
Track API spend per experiment |
$0.05 per request vs $0.50 |
| Latency |
User experience depends on it |
P50=200ms, P99=800ms |
MLflow for LLM Experiments
MLflow is the most popular open-source experiment tracker. Track runs, compare results, and register model artifacts. Example:
import mlflow
import mlflow.anthropic
from anthropic import Anthropic
mlflow.set_experiment("rag-pipeline-v2")
client = Anthropic()
with mlflow.start_run(run_name="baseline"):
mlflow.log_params({
"model": "claude-3-5-sonnet-20241022",
"temperature": 0.0,
"chunk_size": 512,
"top_k": 5,
})
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "What is RAG?"}]
)
mlflow.log_metrics({"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens})
mlflow.log_text(response.content[0].text, "response.txt")
Alternative Tools
- Weights & Biases (W&B): Polished cloud UI. Great for team collaboration and real-time dashboards. Paid tier.
- Neptune: Similar to W&B. Good metadata organization and query interface.
- Aim: Open source, fast, local-first. Growing ecosystem.
- Guild AI: Experiment tracking + automation. Less mature than MLflow.
03 — Abstraction
Integration Standards & API Abstraction
Different model providers have different APIs, parameters, and capabilities. OpenAI uses function_calling. Anthropic uses tool_use. Open-source models often support neither. Building on top of one provider locks you in.
OpenAI-Compatible APIs
The OpenAI API has become a de facto standard. Many providers now offer OpenAI-compatible endpoints:
- Anthropic (partial): Messages API is close but not identical. Anthropic SDK recommended.
- Open-source models: vLLM, Ollama, and text-generation-webui offer OpenAI-compat endpoints.
- Anyscale, Together AI: Open-source model hosting with OpenAI API.
Benefit: write once, swap providers by changing an endpoint URL. Drawback: lowest-common-denominator features (can't use cutting-edge provider-specific features).
LiteLLM for Provider Abstraction
LiteLLM abstracts over 100+ models. Single call signature across all providers. Good for: cost optimization (routing), fallbacks, logging.
Model Context Protocol (MCP)
New standard from Anthropic for secure, standardized tool use. MCP defines how AI agents call external tools. Benefits:
- Vendors can offer standardized tool sets.
- Users can swap tools without rewriting agent code.
- Better security (sandboxed tool execution).
Function Calling Standards
Different models represent function calls differently. Some use JSON Schema, others use different formats. Standardization is emerging but imperfect. When building tools:
- Use JSON Schema for descriptions (most portable).
- Test with multiple models early (catch incompatibilities).
- Consider LangChain/LlamaIndex tool abstractions (they handle this).
04 — Abstraction
Development Frameworks: LangChain, LlamaIndex, DSPy
GenAI development is repetitive: prompt templating, context injection, tool calling, error handling. Frameworks abstract these patterns.
LangChain
Largest ecosystem. High-level abstractions for: chains (sequences of operations), agents (loop with tool calling), RAG pipelines, memory management.
Strengths: Most mature, huge community, tons of integrations (100+ document loaders, vector stores, tools).
Weaknesses: Can feel over-abstracted. Learning curve. Hidden complexity. Heavy dependencies.
LlamaIndex
Specialized for RAG. Simple API for ingestion → indexing → querying. Great for document Q&A applications.
Strengths: Laser-focused on RAG. Simpler than LangChain. Good documentation.
Weaknesses: Less suitable for agent/tool-use workflows. Smaller community.
DSPy
Programmatic, composable. Define modular components, optimize over datasets. Treat prompts as learnable parameters.
Strengths: Elegant design. Programmatic prompt optimization (teleprompter). Great for complex workflows.
Weaknesses: Smaller community. Steeper learning curve. Less production tooling.
Comparison
| Framework | Best for | Abstraction level | Community |
| LangChain |
General GenAI apps, agents |
High |
Huge |
| LlamaIndex |
RAG, document Q&A |
Medium |
Growing |
| DSPy |
Complex, optimized pipelines |
Low (programmatic) |
Niche |
| No framework |
Full control, simple apps |
None (raw SDK) |
N/A |
# LangChain Expression Language (LCEL) — composable chain
# pip install langchain langchain-openai langchain-community
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_community.vectorstores import FAISS
# Build a retrieval-augmented chain using LCEL pipe syntax
docs = ["LangChain supports LCEL for composable chains.",
"FAISS enables fast vector similarity search.",
"OpenAI's gpt-4o-mini is cost-effective for chains."]
vectorstore = FAISS.from_texts(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
prompt = ChatPromptTemplate.from_template(
"Answer using only this context:
{context}
Question: {question}"
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# LCEL chain — each | is a pipe between runnables
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
print(chain.invoke("What is LCEL good for?"))
# Stream the response
for chunk in chain.stream("Explain FAISS in one sentence"):
print(chunk, end="", flush=True)
05 — Strategy
Choosing a Framework (or Not)
Framework choice has long-term consequences. Bet wrong and you're rewriting later.
Decision Tree
- Is this a one-off script or research? Skip frameworks. Use Anthropic SDK directly. Faster to iterate.
- Building RAG (document Q&A)? Start with LlamaIndex. Purpose-built for this. LangChain if you need flexibility.
- Building an agent (tool-using loop)? LangChain or DSPy. DSPy if you want to optimize prompts.
- Building a production system? LangChain (ecosystem, stability). Wrap with custom code for control.
- Team of one, unknown direction? No framework. Use SDK. Refactor later if needed.
Framework Lock-in Risk
Frameworks can become dead weight:
- Heavy abstraction makes it hard to use new model features.
- Upgrading breaks things. No SemVer guarantees.
- Community abandonment. (LangChain v0.1 → 0.2 was painful.)
Mitigation: keep your core logic framework-agnostic. Use frameworks for scaffolding, not core logic.
06 — Tools
GenAI Developer Tooling Ecosystem
07 — Patterns
Common Patterns & Workflows
Typical Development Loop
1
Prototype: Write script with Anthropic SDK directly. No framework overhead.
# DSPy: programming (not prompting) language models
# pip install dspy-ai
import dspy
lm = dspy.LM("openai/gpt-4o-mini", max_tokens=500)
dspy.configure(lm=lm)
class SupportIntent(dspy.Signature):
"""Classify customer support message intent and extract key entities."""
message: str = dspy.InputField()
intent: str = dspy.OutputField(desc="One of: billing, technical, refund, general")
urgency: str = dspy.OutputField(desc="One of: low, medium, high, critical")
summary: str = dspy.OutputField(desc="One sentence summary of the issue")
# Chain of Thought module — DSPy handles prompt optimisation internally
cot = dspy.ChainOfThought(SupportIntent)
result = cot(message="App crashes when I export to PDF on mobile — been happening all week")
print(f"Intent: {result.intent}, Urgency: {result.urgency}")
print(f"Summary: {result.summary}")
# Compile (optimise prompts automatically against labelled examples)
# optimizer = dspy.MIPROv2(metric=your_metric_fn)
# optimized = optimizer.compile(cot, trainset=your_examples)
2
Instrument: Add MLflow tracking. Log prompts, params, results for each experiment.
3
Evaluate: Create eval dataset. Run Promptfoo or Ragas. Measure baseline performance.
4
Optimize: Try different prompts, models, parameters. Log everything. Compare with MLflow.
5
Refactor to framework: Once stable, move to LangChain/LlamaIndex for production features (memory, tool use, etc).
Production Deployment Checklist
- Version everything: prompts, model versions, system prompts.
- Structured logging: every call logs model, latency, cost, outcome.
- Error handling: retries, fallbacks, circuit breakers.
- Monitoring: alerts for cost overruns, latency, failure rate.
- Evaluation: continuous eval on production data. Catch drift early.
- Cost optimization: track tokens/request, optimize prompts, consider cheaper models.
08 — Related Topics
Deep Dive into Subclusters
Development tools break into specialized domains:
09 — Further Reading
References
Documentation & Guides
Practitioner Writing