DSPy

The hand-crafting trap
What DSPy actually is
Core abstractions
Your first DSPy pipeline
Optimisers explained
When DSPy shines vs. overkill
Gotchas and limits

SECTION 01

The hand-crafting trap

Imagine you're tuning a radio. You can turn the dial by hand, inching toward the clearest signal — but every time the station drifts (new model version, different task, new data) you have to start over. That's hand-crafted prompting.

DSPy flips the model: you describe what you want (a signature), wire together how data flows (a program), and let an optimiser search for the best prompt wording and few-shot examples automatically — like an auto-tuner for your radio.

The payoff: prompts that generalise better, improve as your dataset grows, and don't silently break when you swap model providers.

SECTION 02

What DSPy actually is

DSPy (Declarative Self-improving Python) treats an LLM pipeline as a parameterised program — prompts and few-shot examples are learnable weights, not hardcoded strings. You write Python; DSPy compiles it into optimised instructions for whatever backend you choose.

import dspy

# Configure the LM backend (Anthropic, OpenAI, local — swappable)
lm = dspy.LM("anthropic/claude-3-5-haiku-20241022", max_tokens=1024)
dspy.configure(lm=lm)

Three-line setup. Everything else flows from that.

SECTION 03

Core abstractions

Signature — declares inputs and outputs with optional docstring describing the task:

class Sentiment(dspy.Signature):
    '''Classify the sentiment of a product review.'''
    review: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")

Module — a composable unit wrapping a signature with a strategy (Predict, ChainOfThought, ReAct…):

classify = dspy.Predict(Sentiment)
result = classify(review="The keyboard feels mushy and the trackpad lags.")
print(result.sentiment)  # → negative

Program — a Python class subclassing dspy.Module that chains modules:

class RAGPipeline(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

Optimiser (Teleprompter) — tunes the program on a labelled dataset:

from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=my_metric, max_bootstrapped_demos=4)
optimised = optimizer.compile(RAGPipeline(), trainset=train_data)

SECTION 04

Your first DSPy pipeline end-to-end

import dspy

lm = dspy.LM("anthropic/claude-3-5-haiku-20241022")
dspy.configure(lm=lm)

# 1. Define a signature
class QA(dspy.Signature):
    '''Answer a question with a single concise sentence.'''
    question: str = dspy.InputField()
    answer:   str = dspy.OutputField()

# 2. Build a program
class SimpleQA(dspy.Module):
    def __init__(self):
        self.cot = dspy.ChainOfThought(QA)
    def forward(self, question):
        return self.cot(question=question)

# 3. Prepare labelled data (just a few examples to start)
trainset = [
    dspy.Example(question="What year was Python created?", answer="1991").with_inputs("question"),
    dspy.Example(question="Who wrote Hamlet?", answer="William Shakespeare").with_inputs("question"),
    dspy.Example(question="What is the boiling point of water in Celsius?", answer="100°C").with_inputs("question"),
]

# 4. Define a simple metric
def exact_match(example, pred, trace=None):
    return example.answer.lower() in pred.answer.lower()

# 5. Optimise
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=2)
optimised_qa = optimizer.compile(SimpleQA(), trainset=trainset)

# 6. Use it
print(optimised_qa(question="What language is TensorFlow written in?").answer)

After compile(), optimised_qa's internal prompt now includes automatically chosen few-shot examples that maximise your metric.

SECTION 05

Optimisers explained

DSPy ships several teleprompters (optimisers), each with a different tradeoff:

Optimiser	Strategy	Cost	Best for
`LabeledFewShot`	Samples demos from your labels directly	Very low	Quick baseline
`BootstrapFewShot`	Bootstraps demos by running the pipeline on train data	Low	When labels are sparse
`BootstrapFewShotWithRandomSearch`	BootstrapFewShot + random search over demo configs	Medium	Better accuracy on modest budgets
`MIPRO`	Bayesian optimisation over prompt instructions + demos	High	Production pipelines
`BayesianSignatureOptimizer`	Optimises instructions only (no demo search)	Medium	When demos are already good

Rule of thumb: start with BootstrapFewShot; graduate to MIPRO when accuracy plateaus.

SECTION 06

When DSPy shines vs. overkill

Use DSPy when:

You have (or can generate) a labelled evaluation set — optimisers need a metric to minimise.
Your pipeline has multiple chained LLM calls and you're spending hours hand-tweaking each prompt.
You expect to swap models (e.g., prototype on Haiku, deploy on Sonnet) — signatures are backend-agnostic.
Prompt quality needs to stay high as your training data grows.

Skip DSPy when:

You have a single, one-off prompt that works fine — the optimiser overhead isn't worth it.
You have no labelled data and no way to write a deterministic metric.
Latency is everything and you can't afford the compilation step.

SECTION 07

Gotchas and limits

Compilation cost: BootstrapFewShot calls the LLM many times (train set × candidate programs). Budget accordingly — run it overnight if your train set is large.

Metric design is hard: A weak metric (e.g., substring match for open-ended answers) produces badly optimised prompts. Invest time in your metric before investing in optimisation.

Module stacking amplifies failures: In a 4-module pipeline, if module 1 gets the context slightly wrong, modules 2–4 compound the error. Add assertions (dspy.Assert) to catch bad intermediate outputs early.

class SafeRAG(dspy.Module):
    def forward(self, question):
        ctx = self.retrieve(question).passages
        dspy.Assert(len(ctx) > 0, "Retrieval returned nothing — check your corpus.")
        return self.generate(context=ctx, question=question)

Version your compiled programs: Save optimised program state with program.save("my_program.json") and reload with program.load("my_program.json"). Treat these like model checkpoints.

DSPy Optimizer Comparison

DSPy replaces manual prompt engineering with a programmatic framework where LLM pipelines are defined as Python programs and automatically optimized by compilers. Rather than hand-crafting prompts, developers specify the input/output signature and success metric, then let DSPy optimizers find the prompt wording and few-shot examples that maximize the metric on a development set.

Optimizer	Strategy	Calls to LLM	Best For
LabeledFewShot	Uses labeled examples directly	Low	When you have labeled data
BootstrapFewShot	Generates examples via teacher	Medium	Limited labeled data
MIPRO	Bayesian optimization of instructions	High	Max accuracy, big budget
BetterTogether	Jointly optimizes prompts + few-shots	Very high	Production pipelines

DSPy's BootstrapFewShot optimizer generates synthetic training examples by running the pipeline with a teacher model (often a stronger, more expensive model) and keeping examples where the pipeline produces correct outputs. These teacher-generated examples are then used as few-shot demonstrations for the student pipeline (typically a smaller, cheaper model). This knowledge distillation approach enables smaller models to achieve quality closer to large teacher models on specific tasks, without requiring human annotation of training examples.

DSPy's typed predictor system uses Python type annotations to define the expected schema of LLM outputs, automatically adding JSON extraction and validation to the prompt. Typed outputs reduce parsing failures and downstream errors that occur when LLMs produce subtly malformed structured outputs. The type system also makes pipeline interfaces explicit and documentation-friendly — the input/output contract is visible in the Python function signature rather than buried in prompt text, improving maintainability of complex pipelines as they evolve.

DSPy assertions allow defining declarative constraints on intermediate outputs that the optimizer must satisfy. A program assertion like dspy.Assert(len(response) < 500, "Response too long") tells the optimizer that this constraint must hold, and programs that violate it receive zero reward. Hard assertions fail immediately on violation; soft suggestions generate a warning that influences the optimizer but does not terminate the program. These constraint mechanisms bridge the gap between the neural (LLM output) and symbolic (business logic) layers of an application, enforcing invariants without relying on post-hoc filtering.

DSPy's compilation metaphor — treating prompt optimization as compilation — changes the developer workflow compared to manual prompt engineering. Developers write the program logic in Python using DSPy modules, define a metric function that scores outputs, provide a small set of training examples (often 10–50), and call compile(). The compiler handles the tedious work of discovering effective instructions and demonstrations. This workflow is fundamentally more systematic than trial-and-error prompt engineering and produces prompts with measured quality against the metric rather than informal impressions from manual testing.

Multi-hop reasoning pipelines in DSPy decompose complex questions into a sequence of retrieval and reasoning steps, with each step retrieving relevant documents and synthesizing partial answers that feed the next step. A typical three-hop pipeline retrieves context for the initial question, identifies what follow-up information is needed, retrieves that follow-up context, and synthesizes a final answer from all retrieved context. DSPy optimizes each step's prompt jointly through a single compile call, rather than requiring the developer to tune each step independently.

Evaluating DSPy programs requires choosing a metric function that accurately captures the quality of the task. For classification tasks, accuracy suffices. For generation tasks — summarization, question answering, code generation — more sophisticated metrics like F1 overlap, semantic similarity, or LLM-as-judge scoring provide better signal. The metric function is the most influential design decision in a DSPy program: a poorly calibrated metric produces optimized prompts that score well but perform poorly on real-world examples, while a well-calibrated metric directs optimization toward genuine quality improvements.

Table of Contents