Developer Tools · Building

Dev Tools

DevOps tools for LLM projects: MLOps, experiment tracking, and integration standards that accelerate development and ensure reproducibility.

3 Core pillars
Reproducibility Essential
OpenAI-compat Standard
Contents
  1. Why tooling matters
  2. MLOps pillar
  3. Integration standards
  4. Dev frameworks
  5. Choosing frameworks
  6. Tool ecosystem
  7. Patterns & examples
  8. References
01 — Foundation

Why Tooling Matters for GenAI

GenAI development is chaotic without good tooling. You're juggling: different models (OpenAI, Anthropic, open source), different model versions, prompts, parameters, evaluation results, costs. Without tracking, you lose context after switching tabs.

The Core Problems Tooling Solves

Three Pillars of GenAI DevOps

1
MLOps & experiment tracking: Version data, track experiments, log metrics, manage models. Core: MLflow, Weights & Biases, Neptune.
2
Integration standards: Abstract away provider differences. OpenAI-compatible APIs, MCP (Model Context Protocol), function calling schemas. Core: LiteLLM, Anthropic SDK, open standards.
3
Dev frameworks: Higher-level abstractions for RAG, agents, chains. Handle prompt templating, context injection, tool use. Core: LangChain, LlamaIndex, DSPy.
02 — Experiment

MLOps & Experiment Tracking

Experiment tracking for AI is non-negotiable. You need to log: prompts, parameters, results, costs, latencies, human evaluations. Track runs so you can compare "which prompt worked best?" without guessing.

What to Track

DimensionWhy it mattersExample
Prompts Small changes = big results "Always reason step-by-step" vs without
Parameters Temperature, max_tokens, stop sequences T=0 vs T=1 for same task
Model version Claude 3.5 Sonnet vs Haiku Latency, cost, quality tradeoff
Evaluation scores How good is this output? Auto or human BLEU, exact match, human rating
Costs Track API spend per experiment $0.05 per request vs $0.50
Latency User experience depends on it P50=200ms, P99=800ms

MLflow for LLM Experiments

MLflow is the most popular open-source experiment tracker. Track runs, compare results, and register model artifacts. Example:

import mlflow import mlflow.anthropic from anthropic import Anthropic mlflow.set_experiment("rag-pipeline-v2") client = Anthropic() with mlflow.start_run(run_name="baseline"): mlflow.log_params({ "model": "claude-3-5-sonnet-20241022", "temperature": 0.0, "chunk_size": 512, "top_k": 5, }) response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[{"role": "user", "content": "What is RAG?"}] ) mlflow.log_metrics({"input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens}) mlflow.log_text(response.content[0].text, "response.txt")

Alternative Tools

03 — Abstraction

Integration Standards & API Abstraction

Different model providers have different APIs, parameters, and capabilities. OpenAI uses function_calling. Anthropic uses tool_use. Open-source models often support neither. Building on top of one provider locks you in.

OpenAI-Compatible APIs

The OpenAI API has become a de facto standard. Many providers now offer OpenAI-compatible endpoints:

Benefit: write once, swap providers by changing an endpoint URL. Drawback: lowest-common-denominator features (can't use cutting-edge provider-specific features).

LiteLLM for Provider Abstraction

LiteLLM abstracts over 100+ models. Single call signature across all providers. Good for: cost optimization (routing), fallbacks, logging.

Model Context Protocol (MCP)

New standard from Anthropic for secure, standardized tool use. MCP defines how AI agents call external tools. Benefits:

Function Calling Standards

Different models represent function calls differently. Some use JSON Schema, others use different formats. Standardization is emerging but imperfect. When building tools:

04 — Abstraction

Development Frameworks: LangChain, LlamaIndex, DSPy

GenAI development is repetitive: prompt templating, context injection, tool calling, error handling. Frameworks abstract these patterns.

LangChain

Largest ecosystem. High-level abstractions for: chains (sequences of operations), agents (loop with tool calling), RAG pipelines, memory management.

Strengths: Most mature, huge community, tons of integrations (100+ document loaders, vector stores, tools).

Weaknesses: Can feel over-abstracted. Learning curve. Hidden complexity. Heavy dependencies.

LlamaIndex

Specialized for RAG. Simple API for ingestion → indexing → querying. Great for document Q&A applications.

Strengths: Laser-focused on RAG. Simpler than LangChain. Good documentation.

Weaknesses: Less suitable for agent/tool-use workflows. Smaller community.

DSPy

Programmatic, composable. Define modular components, optimize over datasets. Treat prompts as learnable parameters.

Strengths: Elegant design. Programmatic prompt optimization (teleprompter). Great for complex workflows.

Weaknesses: Smaller community. Steeper learning curve. Less production tooling.

Comparison

FrameworkBest forAbstraction levelCommunity
LangChain General GenAI apps, agents High Huge
LlamaIndex RAG, document Q&A Medium Growing
DSPy Complex, optimized pipelines Low (programmatic) Niche
No framework Full control, simple apps None (raw SDK) N/A
# LangChain Expression Language (LCEL) — composable chain # pip install langchain langchain-openai langchain-community from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough from langchain_community.vectorstores import FAISS # Build a retrieval-augmented chain using LCEL pipe syntax docs = ["LangChain supports LCEL for composable chains.", "FAISS enables fast vector similarity search.", "OpenAI's gpt-4o-mini is cost-effective for chains."] vectorstore = FAISS.from_texts(docs, OpenAIEmbeddings()) retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) prompt = ChatPromptTemplate.from_template( "Answer using only this context: {context} Question: {question}" ) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # LCEL chain — each | is a pipe between runnables chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) print(chain.invoke("What is LCEL good for?")) # Stream the response for chunk in chain.stream("Explain FAISS in one sentence"): print(chunk, end="", flush=True)
05 — Strategy

Choosing a Framework (or Not)

Framework choice has long-term consequences. Bet wrong and you're rewriting later.

Decision Tree

Framework Lock-in Risk

Frameworks can become dead weight:

Mitigation: keep your core logic framework-agnostic. Use frameworks for scaffolding, not core logic.

06 — Tools

GenAI Developer Tooling Ecosystem

MLflow
Experiment Tracking
Open-source experiment tracking. Track runs, compare results, register models. Industry standard.
Weights & Biases
Experiment Tracking
Cloud experiment tracker. Real-time dashboards, team collaboration, beautiful UI. Paid.
LangChain
Framework
Largest LLM framework. Chains, agents, RAG, memory. Huge ecosystem of integrations.
LlamaIndex
RAG Framework
Specialized for RAG. Document ingestion, indexing, querying. Simple and focused.
DSPy
Optimization Framework
Programmatic prompt optimization. Modular pipelines, learnable prompts. Novel approach.
LiteLLM
Provider Abstraction
Unified interface for 100+ models. Cost tracking, fallbacks, logging.
Anthropic SDK
Official SDK
Official Python/JS SDK for Claude. Simple, well-documented, recommended.
Promptfoo
Prompt Evaluation
Test and evaluate prompts. Compare versions, run evals, track results.
Ragas
RAG Evaluation
Benchmark RAG systems. Metrics for retrieval quality, answer relevance, factuality.
OpenAI Evals
Evaluation Framework
Framework for LLM evals. Define metrics, run experiments, track progress.
Hugging Face Hub
Model Registry
Host models, datasets, spaces. Central hub for HF ecosystem. Free and paid tiers.
vLLM
Inference Engine
High-performance LLM serving. Paged attention, KV cache opt. Best for open-source models.
07 — Patterns

Common Patterns & Workflows

Typical Development Loop

1
Prototype: Write script with Anthropic SDK directly. No framework overhead.
# DSPy: programming (not prompting) language models # pip install dspy-ai import dspy lm = dspy.LM("openai/gpt-4o-mini", max_tokens=500) dspy.configure(lm=lm) class SupportIntent(dspy.Signature): """Classify customer support message intent and extract key entities.""" message: str = dspy.InputField() intent: str = dspy.OutputField(desc="One of: billing, technical, refund, general") urgency: str = dspy.OutputField(desc="One of: low, medium, high, critical") summary: str = dspy.OutputField(desc="One sentence summary of the issue") # Chain of Thought module — DSPy handles prompt optimisation internally cot = dspy.ChainOfThought(SupportIntent) result = cot(message="App crashes when I export to PDF on mobile — been happening all week") print(f"Intent: {result.intent}, Urgency: {result.urgency}") print(f"Summary: {result.summary}") # Compile (optimise prompts automatically against labelled examples) # optimizer = dspy.MIPROv2(metric=your_metric_fn) # optimized = optimizer.compile(cot, trainset=your_examples)
2
Instrument: Add MLflow tracking. Log prompts, params, results for each experiment.
3
Evaluate: Create eval dataset. Run Promptfoo or Ragas. Measure baseline performance.
4
Optimize: Try different prompts, models, parameters. Log everything. Compare with MLflow.
5
Refactor to framework: Once stable, move to LangChain/LlamaIndex for production features (memory, tool use, etc).

Production Deployment Checklist

08 — Related Topics

Deep Dive into Subclusters

Development tools break into specialized domains:

09 — Further Reading

References

Documentation & Guides
Practitioner Writing