Dev Tools

Contents

Why tooling matters
MLOps pillar
Integration standards
Dev frameworks
Choosing frameworks
Tool ecosystem
Patterns & examples
References

01 — Foundation

Why Tooling Matters for GenAI

GenAI development is chaotic without good tooling. You're juggling: different models (OpenAI, Anthropic, open source), different model versions, prompts, parameters, evaluation results, costs. Without tracking, you lose context after switching tabs.

The Core Problems Tooling Solves

Reproducibility: Can you recreate the results from last week? Which prompt version was that? What model size?
Experiment tracking: You tried 50 different prompts. Which was best? Did the improvement stick?
Cost tracking: Which model/feature uses the most API budget?
Model versioning: When Claude 3.5 Sonnet was released, what happened to my evals?
Integration complexity: Your app calls Anthropic for reasoning, OpenAI for embeddings, open-source model for fast classification. How do you abstract this?

Three Pillars of GenAI DevOps

MLOps & experiment tracking: Version data, track experiments, log metrics, manage models. Core: MLflow, Weights & Biases, Neptune.

Integration standards: Abstract away provider differences. OpenAI-compatible APIs, MCP (Model Context Protocol), function calling schemas. Core: LiteLLM, Anthropic SDK, open standards.

Dev frameworks: Higher-level abstractions for RAG, agents, chains. Handle prompt templating, context injection, tool use. Core: LangChain, LlamaIndex, DSPy.

02 — Experiment

MLOps & Experiment Tracking

Experiment tracking for AI is non-negotiable. You need to log: prompts, parameters, results, costs, latencies, human evaluations. Track runs so you can compare "which prompt worked best?" without guessing.

What to Track

Dimension	Why it matters	Example
Prompts	Small changes = big results	"Always reason step-by-step" vs without
Parameters	Temperature, max_tokens, stop sequences	T=0 vs T=1 for same task
Model version	Claude 3.5 Sonnet vs Haiku	Latency, cost, quality tradeoff
Evaluation scores	How good is this output? Auto or human	BLEU, exact match, human rating
Costs	Track API spend per experiment	$0.05 per request vs $0.50
Latency	User experience depends on it	P50=200ms, P99=800ms

MLflow for LLM Experiments

MLflow is the most popular open-source experiment tracker. Track runs, compare results, and register model artifacts. Example:

import mlflow import mlflow.anthropic from anthropic import Anthropic mlflow.set_experiment("rag-pipeline-v2") client = Anthropic() with mlflow.start_run(run_name="baseline"): mlflow.log_params({ "model": "claude-3-5-sonnet-20241022", "temperature": 0.0, "chunk_size": 512, "top_k": 5, }) response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[{"role": "user", "content": "What is RAG?"}] ) mlflow.log_metrics({"input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens}) mlflow.log_text(response.content[0].text, "response.txt")

Alternative Tools

Weights & Biases (W&B): Polished cloud UI. Great for team collaboration and real-time dashboards. Paid tier.
Neptune: Similar to W&B. Good metadata organization and query interface.
Aim: Open source, fast, local-first. Growing ecosystem.
Guild AI: Experiment tracking + automation. Less mature than MLflow.

03 — Abstraction

Integration Standards & API Abstraction

Different model providers have different APIs, parameters, and capabilities. OpenAI uses function_calling. Anthropic uses tool_use. Open-source models often support neither. Building on top of one provider locks you in.

OpenAI-Compatible APIs

The OpenAI API has become a de facto standard. Many providers now offer OpenAI-compatible endpoints:

Anthropic (partial): Messages API is close but not identical. Anthropic SDK recommended.
Open-source models: vLLM, Ollama, and text-generation-webui offer OpenAI-compat endpoints.
Anyscale, Together AI: Open-source model hosting with OpenAI API.

Benefit: write once, swap providers by changing an endpoint URL. Drawback: lowest-common-denominator features (can't use cutting-edge provider-specific features).

LiteLLM for Provider Abstraction

LiteLLM abstracts over 100+ models. Single call signature across all providers. Good for: cost optimization (routing), fallbacks, logging.

Model Context Protocol (MCP)

New standard from Anthropic for secure, standardized tool use. MCP defines how AI agents call external tools. Benefits:

Vendors can offer standardized tool sets.
Users can swap tools without rewriting agent code.
Better security (sandboxed tool execution).

Function Calling Standards

Different models represent function calls differently. Some use JSON Schema, others use different formats. Standardization is emerging but imperfect. When building tools:

Use JSON Schema for descriptions (most portable).
Test with multiple models early (catch incompatibilities).
Consider LangChain/LlamaIndex tool abstractions (they handle this).

04 — Abstraction

Development Frameworks: LangChain, LlamaIndex, DSPy

GenAI development is repetitive: prompt templating, context injection, tool calling, error handling. Frameworks abstract these patterns.

LangChain

Largest ecosystem. High-level abstractions for: chains (sequences of operations), agents (loop with tool calling), RAG pipelines, memory management.

Strengths: Most mature, huge community, tons of integrations (100+ document loaders, vector stores, tools).

Weaknesses: Can feel over-abstracted. Learning curve. Hidden complexity. Heavy dependencies.

LlamaIndex

Specialized for RAG. Simple API for ingestion → indexing → querying. Great for document Q&A applications.

Strengths: Laser-focused on RAG. Simpler than LangChain. Good documentation.

Weaknesses: Less suitable for agent/tool-use workflows. Smaller community.

DSPy

Programmatic, composable. Define modular components, optimize over datasets. Treat prompts as learnable parameters.

Strengths: Elegant design. Programmatic prompt optimization (teleprompter). Great for complex workflows.

Weaknesses: Smaller community. Steeper learning curve. Less production tooling.

Comparison

Framework	Best for	Abstraction level	Community
LangChain	General GenAI apps, agents	High	Huge
LlamaIndex	RAG, document Q&A	Medium	Growing
DSPy	Complex, optimized pipelines	Low (programmatic)	Niche
No framework	Full control, simple apps	None (raw SDK)	N/A

# LangChain Expression Language (LCEL) — composable chain # pip install langchain langchain-openai langchain-community from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough from langchain_community.vectorstores import FAISS # Build a retrieval-augmented chain using LCEL pipe syntax docs = ["LangChain supports LCEL for composable chains.", "FAISS enables fast vector similarity search.", "OpenAI's gpt-4o-mini is cost-effective for chains."] vectorstore = FAISS.from_texts(docs, OpenAIEmbeddings()) retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) prompt = ChatPromptTemplate.from_template( "Answer using only this context: {context} Question: {question}" ) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # LCEL chain — each | is a pipe between runnables chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) print(chain.invoke("What is LCEL good for?")) # Stream the response for chunk in chain.stream("Explain FAISS in one sentence"): print(chunk, end="", flush=True)

05 — Strategy

Choosing a Framework (or Not)

Framework choice has long-term consequences. Bet wrong and you're rewriting later.

Decision Tree

Is this a one-off script or research? Skip frameworks. Use Anthropic SDK directly. Faster to iterate.
Building RAG (document Q&A)? Start with LlamaIndex. Purpose-built for this. LangChain if you need flexibility.
Building an agent (tool-using loop)? LangChain or DSPy. DSPy if you want to optimize prompts.
Building a production system? LangChain (ecosystem, stability). Wrap with custom code for control.
Team of one, unknown direction? No framework. Use SDK. Refactor later if needed.

Framework Lock-in Risk

Frameworks can become dead weight:

Heavy abstraction makes it hard to use new model features.
Upgrading breaks things. No SemVer guarantees.
Community abandonment. (LangChain v0.1 → 0.2 was painful.)

Mitigation: keep your core logic framework-agnostic. Use frameworks for scaffolding, not core logic.

06 — Tools

GenAI Developer Tooling Ecosystem

MLflow

Experiment Tracking

Open-source experiment tracking. Track runs, compare results, register models. Industry standard.

Weights & Biases

Experiment Tracking

Cloud experiment tracker. Real-time dashboards, team collaboration, beautiful UI. Paid.

LangChain

Framework

Largest LLM framework. Chains, agents, RAG, memory. Huge ecosystem of integrations.

LlamaIndex

RAG Framework

Specialized for RAG. Document ingestion, indexing, querying. Simple and focused.

DSPy

Optimization Framework

Programmatic prompt optimization. Modular pipelines, learnable prompts. Novel approach.

LiteLLM

Provider Abstraction

Unified interface for 100+ models. Cost tracking, fallbacks, logging.

Anthropic SDK

Official SDK

Official Python/JS SDK for Claude. Simple, well-documented, recommended.

Promptfoo

Prompt Evaluation

Test and evaluate prompts. Compare versions, run evals, track results.

Ragas

RAG Evaluation

Benchmark RAG systems. Metrics for retrieval quality, answer relevance, factuality.

OpenAI Evals

Evaluation Framework

Framework for LLM evals. Define metrics, run experiments, track progress.

Hugging Face Hub

Model Registry

Host models, datasets, spaces. Central hub for HF ecosystem. Free and paid tiers.

vLLM

Inference Engine

High-performance LLM serving. Paged attention, KV cache opt. Best for open-source models.

07 — Patterns

Common Patterns & Workflows

Typical Development Loop

Prototype: Write script with Anthropic SDK directly. No framework overhead.

# DSPy: programming (not prompting) language models # pip install dspy-ai import dspy lm = dspy.LM("openai/gpt-4o-mini", max_tokens=500) dspy.configure(lm=lm) class SupportIntent(dspy.Signature): """Classify customer support message intent and extract key entities.""" message: str = dspy.InputField() intent: str = dspy.OutputField(desc="One of: billing, technical, refund, general") urgency: str = dspy.OutputField(desc="One of: low, medium, high, critical") summary: str = dspy.OutputField(desc="One sentence summary of the issue") # Chain of Thought module — DSPy handles prompt optimisation internally cot = dspy.ChainOfThought(SupportIntent) result = cot(message="App crashes when I export to PDF on mobile — been happening all week") print(f"Intent: {result.intent}, Urgency: {result.urgency}") print(f"Summary: {result.summary}") # Compile (optimise prompts automatically against labelled examples) # optimizer = dspy.MIPROv2(metric=your_metric_fn) # optimized = optimizer.compile(cot, trainset=your_examples)

Instrument: Add MLflow tracking. Log prompts, params, results for each experiment.

Evaluate: Create eval dataset. Run Promptfoo or Ragas. Measure baseline performance.

Optimize: Try different prompts, models, parameters. Log everything. Compare with MLflow.

Refactor to framework: Once stable, move to LangChain/LlamaIndex for production features (memory, tool use, etc).

Production Deployment Checklist

Version everything: prompts, model versions, system prompts.
Structured logging: every call logs model, latency, cost, outcome.
Error handling: retries, fallbacks, circuit breakers.
Monitoring: alerts for cost overruns, latency, failure rate.
Evaluation: continuous eval on production data. Catch drift early.
Cost optimization: track tokens/request, optimize prompts, consider cheaper models.

08 — Related Topics

Deep Dive into Subclusters

Development tools break into specialized domains:

MLOps & Experiment Tracking: Experiment tracking, model registry, CI/CD for models, reproducibility.
Integration Standards: OpenAI-compat APIs, MCP, function calling schemas, provider abstraction.
Dev Frameworks: LangChain, LlamaIndex, DSPy, framework selection.

09 — Further Reading

References

Documentation & Guides

Docs MLflow documentation. mlflow.org/docs ↗
Docs LangChain docs. langchain.com/docs ↗
Docs LlamaIndex documentation. llamaindex.ai/docs ↗
Docs DSPy guide. github.com/stanfordnlp/dspy ↗
Docs Anthropic SDK. anthropics/anthropic-sdk-python ↗
Docs LiteLLM documentation. docs.litellm.ai ↗

Practitioner Writing

Blog Chip Huyen. (2023). Building Production Machine Learning Systems. — huyenchip.com ↗
Blog LangChain. (2024). LLM Development Best Practices. — blog.langchain.dev ↗
Blog Weights & Biases. (2024). LLM Experiment Tracking Guide. — wandb.ai/articles/llms ↗
Blog Made With ML. (2024). LLM Development Course. — madewithml.com ↗

Dev Tools

Why Tooling Matters for GenAI

The Core Problems Tooling Solves

Three Pillars of GenAI DevOps

MLOps & Experiment Tracking

What to Track

MLflow for LLM Experiments

Alternative Tools

Integration Standards & API Abstraction

OpenAI-Compatible APIs

LiteLLM for Provider Abstraction

Model Context Protocol (MCP)

Function Calling Standards

Development Frameworks: LangChain, LlamaIndex, DSPy

LangChain

LlamaIndex

DSPy

Comparison

Choosing a Framework (or Not)

Decision Tree

Framework Lock-in Risk

GenAI Developer Tooling Ecosystem

Common Patterns & Workflows

Typical Development Loop

Production Deployment Checklist

Deep Dive into Subclusters

References

Related concepts