Combining LLMs with retrieval, tools, and verifiers — LMQL, DSPy, and production system patterns
A single LLM call is rarely enough for production systems. Compound AI combines multiple LLMs, retrieval engines, tool calls, and verification loops into unified workflows.
Single LLM limitations: No access to real-time data. No ability to use external tools. No correction loop if output is wrong. Context window constraints for long documents. High latency if you need reasoning before responding.
Compound patterns solve this: Chain retrieval before generation (RAG). Route different queries to different models. Call tools and integrate results. Verify outputs and retry if wrong. Decompose problems into multi-step LLM calls.
| Pattern | Use Case | Complexity | Latency | Cost |
|---|---|---|---|---|
| RAG (Retrieve then Generate) | QA over docs, facts from external source | Low | Medium | Low |
| Tool-augmented (Agent + tools) | Calculator, search, API calls, reasoning | Medium | High | Medium |
| Verifier chains (Generate + verify) | Code generation, math, factual outputs | Medium | High | Medium |
| Ensemble (Multiple models) | High-stakes decisions, diversity of views | Medium | High | High |
| Multi-step (Decompose problem) | Complex reasoning, hierarchical planning | High | High | Medium |
RAG: Static knowledge, document search, Q&A. Simplest, lowest cost. Tool-augmented: Need real-time info, calculations, or external APIs. Verifiers: Correctness matters; easy to validate output. Ensemble: High stakes; need consensus. Multi-step: Complex decomposition; clear subtasks.
Tools like DSPy and LMQL provide higher-level abstractions for building compound systems. Instead of managing prompts and parsing, you define the computation graph.
DSPy treats LLM calls as modules with typed signatures. Chain modules together. Optimize prompts automatically via demonstrations. Composable and testable.
Key benefits: Type-safe. Modular. Composable. Automatic optimization via in-context learning. Great for research and structured problems.
Not all queries need the same model or workflow. Smart routing dispatches based on query characteristics.
import dspy
# Configure LLM backend
lm = dspy.LM("openai/gpt-4o-mini", max_tokens=1024)
dspy.configure(lm=lm)
# Define signatures (input/output contracts)
class Summarize(dspy.Signature):
"""Summarize a long document into key points."""
document: str = dspy.InputField()
summary: str = dspy.OutputField(desc="3-5 bullet points")
class AnswerFromContext(dspy.Signature):
"""Answer a question using only the provided context."""
context: str = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField()
confidence: float = dspy.OutputField(desc="0.0-1.0")
# Build a compound pipeline using modules
class RAGPipeline(dspy.Module):
def __init__(self, retriever):
self.retriever = retriever
self.summarize = dspy.ChainOfThought(Summarize)
self.answer = dspy.ChainOfThought(AnswerFromContext)
def forward(self, question: str) -> dspy.Prediction:
# Retrieve relevant docs
docs = self.retriever(question)
context = "
".join(docs)
# Summarize context first (reduces noise)
summary = self.summarize(document=context).summary
# Answer from summarized context
return self.answer(context=summary, question=question)
# DSPy optimizes prompts automatically given a training set
# optimizer = dspy.BootstrapFewShot(metric=answer_exact_match)
# optimized_rag = optimizer.compile(RAGPipeline(retriever), trainset=train_data)
# Run pipeline
# prediction = optimized_rag("What is the capital of France?")
# print(f"Answer: {prediction.answer} ({prediction.confidence:.0%})")
Generate output, then verify. If wrong, retry or refine. Useful for code, math, and factual claims.
Self-check: Same model judges its own output. Cheaper but less reliable. Critic model: Separate model critiques. More reliable but higher cost. External validators: Run code, check facts against DB, validate structure.
Compound systems add latency. Each LLM call: 500ms–2s. Each retrieval: 50–500ms. Each tool call: 100ms–10s. Design within your budget. For interactive systems, target <1s for standard requests.
Compound systems multiply costs. RAG: retrieval + generation. Routing: classification + handling. Verification: multiple LLM calls. Track cost per query. Optimize hot paths. Use cheaper models where acceptable.
Log every LLM call, tool invocation, and decision point. Track latency and cost breakdown. Identify bottlenecks. Monitor error rates by component. Alert on high-latency or high-cost queries.