Architecture · Systems

Compound AI Systems

Combining LLMs with retrieval, tools, and verifiers — LMQL, DSPy, and production system patterns

4 patterns
6 sections
7 tools
Contents
  1. What is compound AI
  2. Core patterns
  3. LM programming
  4. Router patterns
  5. Verifier loops
  6. Production considerations
  7. Tools & frameworks
  8. References
01 — Definition

What Is Compound AI

A single LLM call is rarely enough for production systems. Compound AI combines multiple LLMs, retrieval engines, tool calls, and verification loops into unified workflows.

Single LLM limitations: No access to real-time data. No ability to use external tools. No correction loop if output is wrong. Context window constraints for long documents. High latency if you need reasoning before responding.

Compound patterns solve this: Chain retrieval before generation (RAG). Route different queries to different models. Call tools and integrate results. Verify outputs and retry if wrong. Decompose problems into multi-step LLM calls.

💡 Compound AI is the norm in production. Most shipped LLM systems aren't a single model. They're orchestrations of retrieval, routing, verification, and tool use.
02 — Architectures

Core Patterns

PatternUse CaseComplexityLatencyCost
RAG
(Retrieve then Generate)
QA over docs, facts from external sourceLowMediumLow
Tool-augmented
(Agent + tools)
Calculator, search, API calls, reasoningMediumHighMedium
Verifier chains
(Generate + verify)
Code generation, math, factual outputsMediumHighMedium
Ensemble
(Multiple models)
High-stakes decisions, diversity of viewsMediumHighHigh
Multi-step
(Decompose problem)
Complex reasoning, hierarchical planningHighHighMedium

When to Use Each

RAG: Static knowledge, document search, Q&A. Simplest, lowest cost. Tool-augmented: Need real-time info, calculations, or external APIs. Verifiers: Correctness matters; easy to validate output. Ensemble: High stakes; need consensus. Multi-step: Complex decomposition; clear subtasks.

⚠️ Composition adds latency. Each LLM call adds 500ms–2s. RAG adds retrieval time. Tool calls wait for external APIs. Design for your latency budget.
03 — Abstraction

LM Programming Frameworks

Tools like DSPy and LMQL provide higher-level abstractions for building compound systems. Instead of managing prompts and parsing, you define the computation graph.

DSPy: Modular LLM Programs

DSPy treats LLM calls as modules with typed signatures. Chain modules together. Optimize prompts automatically via demonstrations. Composable and testable.

from dspy.functional import ChainOfThought class GenerateAnswer(dspy.Signature): """Answer a question about the document.""" context = dspy.InputField(desc="may contain relevant facts") question = dspy.InputField() answer = dspy.OutputField(desc="short, factual answer") class RAGSystem(dspy.Module): def __init__(self, num_passages=3): super().__init__() self.retrieve = dspy.Retrieve(k=num_passages) self.generate_answer = ChainOfThought(GenerateAnswer) def forward(self, question): context = self.retrieve(question).passages prediction = self.generate_answer( context="\n".join(context), question=question ) return prediction # Usage rag = RAGSystem() result = rag.forward("What is RAG?") print(result.answer)

Key benefits: Type-safe. Modular. Composable. Automatic optimization via in-context learning. Great for research and structured problems.

04 — Smart Dispatch

Router Patterns

Not all queries need the same model or workflow. Smart routing dispatches based on query characteristics.

Routing Strategies

Semantic Routing

  • Embed query, compare to examples
  • Route to specialized handler
  • Use for domain-specific tasks

LLM Routing

  • LLM classifies query type
  • Route to best handler
  • More flexible, slower

Capability Routing

  • Route to model that excels at task
  • Fast model for simple, slow for complex
  • Minimize cost

Load Routing

  • Route to least busy endpoint
  • Balance across multiple replicas
  • Maximize throughput
Python · DSPy: declarative compound AI system with self-optimization
import dspy

# Configure LLM backend
lm = dspy.LM("openai/gpt-4o-mini", max_tokens=1024)
dspy.configure(lm=lm)

# Define signatures (input/output contracts)
class Summarize(dspy.Signature):
    """Summarize a long document into key points."""
    document: str = dspy.InputField()
    summary: str = dspy.OutputField(desc="3-5 bullet points")

class AnswerFromContext(dspy.Signature):
    """Answer a question using only the provided context."""
    context: str = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()
    confidence: float = dspy.OutputField(desc="0.0-1.0")

# Build a compound pipeline using modules
class RAGPipeline(dspy.Module):
    def __init__(self, retriever):
        self.retriever = retriever
        self.summarize = dspy.ChainOfThought(Summarize)
        self.answer    = dspy.ChainOfThought(AnswerFromContext)

    def forward(self, question: str) -> dspy.Prediction:
        # Retrieve relevant docs
        docs = self.retriever(question)
        context = "

".join(docs)

        # Summarize context first (reduces noise)
        summary = self.summarize(document=context).summary

        # Answer from summarized context
        return self.answer(context=summary, question=question)

# DSPy optimizes prompts automatically given a training set
# optimizer = dspy.BootstrapFewShot(metric=answer_exact_match)
# optimized_rag = optimizer.compile(RAGPipeline(retriever), trainset=train_data)

# Run pipeline
# prediction = optimized_rag("What is the capital of France?")
# print(f"Answer: {prediction.answer} ({prediction.confidence:.0%})")

Router Example

def route_query(query: str) -> str: """Route query to appropriate handler.""" # Semantic routing: embed and classify embedding = embed_model.encode(query) if similarity(embedding, math_examples) > 0.7: return "math_handler" elif similarity(embedding, qa_examples) > 0.7: return "rag_handler" else: return "general_chat" def process(query: str): handler = route_query(query) if handler == "math_handler": return math_solver(query) elif handler == "rag_handler": docs = retriever(query) return llm_generate(query, context=docs) else: return llm_generate(query)
05 — Validation

Verifier & Self-Check Loops

Generate output, then verify. If wrong, retry or refine. Useful for code, math, and factual claims.

Patterns

Self-check: Same model judges its own output. Cheaper but less reliable. Critic model: Separate model critiques. More reliable but higher cost. External validators: Run code, check facts against DB, validate structure.

Workflow

1
Generate. LLM produces output (code, answer, plan)
2
Validate. External check or LLM critique. Pass/fail signal.
3
Decide. If pass, return. If fail, try again or collect feedback.
4
4
Refine. Feed error back to LLM, retry with better context.
💡 Verifiers multiply latency but improve quality. For latency-critical systems, balance with budgets. For correctness-critical (code, finance), verifiers are essential.
06 — Operations

Production Considerations

Latency Budget

Compound systems add latency. Each LLM call: 500ms–2s. Each retrieval: 50–500ms. Each tool call: 100ms–10s. Design within your budget. For interactive systems, target <1s for standard requests.

Cost Tracking

Compound systems multiply costs. RAG: retrieval + generation. Routing: classification + handling. Verification: multiple LLM calls. Track cost per query. Optimize hot paths. Use cheaper models where acceptable.

Observability

Log every LLM call, tool invocation, and decision point. Track latency and cost breakdown. Identify bottlenecks. Monitor error rates by component. Alert on high-latency or high-cost queries.

Failure Modes

Hallucination

  • LLM invents facts
  • Mitigation: verify, cite sources

Bad Routing

  • Wrong handler called
  • Mitigation: test routing

Tool Failures

  • API down, network error
  • Mitigation: graceful degradation

Context Collapse

  • Too much data, model loses signal
  • Mitigation: compression, top-k
07 — Ecosystem

Tools & Frameworks

Framework
DSPy
Modular LLM programs. Type-safe. Automatic prompt optimization. Research-friendly.
Language
LMQL
Language for LM interaction. Constrained decoding. Control flow integration.
Framework
LangChain
Agent framework. Tool use, retrieval, chains. Large ecosystem.
Framework
Marvin
Pythonic AI. Function calls, structured outputs. Built on Pydantic.
Library
Outlines
Constrained decoding. Guided generation. JSON, regex, grammar.
Framework
Semantic Router
Semantic routing layer. Route to handlers. Lightweight.
Library
Guidance
LLM control via templating. Constrained generation. Verbose output.
Router
LiteLLM
Unified API across LLMs. Routing, fallbacks, caching.
08 — Further Reading

References

Research & Papers
  • Paper Pourreza, M., & Raffel, C. (2024). DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. arXiv:2310.03714. — arxiv:2310.03714 ↗
  • Paper Wallace, E., et al. (2021). Eliciting Task-Relevant Properties from Large Language Models with Self-Supervision. arXiv:2011.05316. — arxiv:2011.05316 ↗
Frameworks & Tools
Practitioner Writing