A framework for algorithmically optimising LLM prompts and weights instead of hand-crafting them.
Imagine you're tuning a radio. You can turn the dial by hand, inching toward the clearest signal — but every time the station drifts (new model version, different task, new data) you have to start over. That's hand-crafted prompting.
DSPy flips the model: you describe what you want (a signature), wire together how data flows (a program), and let an optimiser search for the best prompt wording and few-shot examples automatically — like an auto-tuner for your radio.
The payoff: prompts that generalise better, improve as your dataset grows, and don't silently break when you swap model providers.
DSPy (Declarative Self-improving Python) treats an LLM pipeline as a parameterised program — prompts and few-shot examples are learnable weights, not hardcoded strings. You write Python; DSPy compiles it into optimised instructions for whatever backend you choose.
import dspy
# Configure the LM backend (Anthropic, OpenAI, local — swappable)
lm = dspy.LM("anthropic/claude-3-5-haiku-20241022", max_tokens=1024)
dspy.configure(lm=lm)
Three-line setup. Everything else flows from that.
Signature — declares inputs and outputs with optional docstring describing the task:
class Sentiment(dspy.Signature):
'''Classify the sentiment of a product review.'''
review: str = dspy.InputField()
sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
Module — a composable unit wrapping a signature with a strategy (Predict, ChainOfThought, ReAct…):
classify = dspy.Predict(Sentiment)
result = classify(review="The keyboard feels mushy and the trackpad lags.")
print(result.sentiment) # → negative
Program — a Python class subclassing dspy.Module that chains modules:
class RAGPipeline(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=3)
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
Optimiser (Teleprompter) — tunes the program on a labelled dataset:
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=my_metric, max_bootstrapped_demos=4)
optimised = optimizer.compile(RAGPipeline(), trainset=train_data)
import dspy
lm = dspy.LM("anthropic/claude-3-5-haiku-20241022")
dspy.configure(lm=lm)
# 1. Define a signature
class QA(dspy.Signature):
'''Answer a question with a single concise sentence.'''
question: str = dspy.InputField()
answer: str = dspy.OutputField()
# 2. Build a program
class SimpleQA(dspy.Module):
def __init__(self):
self.cot = dspy.ChainOfThought(QA)
def forward(self, question):
return self.cot(question=question)
# 3. Prepare labelled data (just a few examples to start)
trainset = [
dspy.Example(question="What year was Python created?", answer="1991").with_inputs("question"),
dspy.Example(question="Who wrote Hamlet?", answer="William Shakespeare").with_inputs("question"),
dspy.Example(question="What is the boiling point of water in Celsius?", answer="100°C").with_inputs("question"),
]
# 4. Define a simple metric
def exact_match(example, pred, trace=None):
return example.answer.lower() in pred.answer.lower()
# 5. Optimise
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=2)
optimised_qa = optimizer.compile(SimpleQA(), trainset=trainset)
# 6. Use it
print(optimised_qa(question="What language is TensorFlow written in?").answer)
After compile(), optimised_qa's internal prompt now includes automatically chosen few-shot examples that maximise your metric.
DSPy ships several teleprompters (optimisers), each with a different tradeoff:
| Optimiser | Strategy | Cost | Best for |
|---|---|---|---|
LabeledFewShot | Samples demos from your labels directly | Very low | Quick baseline |
BootstrapFewShot | Bootstraps demos by running the pipeline on train data | Low | When labels are sparse |
BootstrapFewShotWithRandomSearch | BootstrapFewShot + random search over demo configs | Medium | Better accuracy on modest budgets |
MIPRO | Bayesian optimisation over prompt instructions + demos | High | Production pipelines |
BayesianSignatureOptimizer | Optimises instructions only (no demo search) | Medium | When demos are already good |
Rule of thumb: start with BootstrapFewShot; graduate to MIPRO when accuracy plateaus.
Use DSPy when:
Skip DSPy when:
Compilation cost: BootstrapFewShot calls the LLM many times (train set × candidate programs). Budget accordingly — run it overnight if your train set is large.
Metric design is hard: A weak metric (e.g., substring match for open-ended answers) produces badly optimised prompts. Invest time in your metric before investing in optimisation.
Module stacking amplifies failures: In a 4-module pipeline, if module 1 gets the context slightly wrong, modules 2–4 compound the error. Add assertions (dspy.Assert) to catch bad intermediate outputs early.
class SafeRAG(dspy.Module):
def forward(self, question):
ctx = self.retrieve(question).passages
dspy.Assert(len(ctx) > 0, "Retrieval returned nothing — check your corpus.")
return self.generate(context=ctx, question=question)
Version your compiled programs: Save optimised program state with program.save("my_program.json") and reload with program.load("my_program.json"). Treat these like model checkpoints.
DSPy replaces manual prompt engineering with a programmatic framework where LLM pipelines are defined as Python programs and automatically optimized by compilers. Rather than hand-crafting prompts, developers specify the input/output signature and success metric, then let DSPy optimizers find the prompt wording and few-shot examples that maximize the metric on a development set.
| Optimizer | Strategy | Calls to LLM | Best For |
|---|---|---|---|
| LabeledFewShot | Uses labeled examples directly | Low | When you have labeled data |
| BootstrapFewShot | Generates examples via teacher | Medium | Limited labeled data |
| MIPRO | Bayesian optimization of instructions | High | Max accuracy, big budget |
| BetterTogether | Jointly optimizes prompts + few-shots | Very high | Production pipelines |
DSPy's BootstrapFewShot optimizer generates synthetic training examples by running the pipeline with a teacher model (often a stronger, more expensive model) and keeping examples where the pipeline produces correct outputs. These teacher-generated examples are then used as few-shot demonstrations for the student pipeline (typically a smaller, cheaper model). This knowledge distillation approach enables smaller models to achieve quality closer to large teacher models on specific tasks, without requiring human annotation of training examples.
DSPy's typed predictor system uses Python type annotations to define the expected schema of LLM outputs, automatically adding JSON extraction and validation to the prompt. Typed outputs reduce parsing failures and downstream errors that occur when LLMs produce subtly malformed structured outputs. The type system also makes pipeline interfaces explicit and documentation-friendly — the input/output contract is visible in the Python function signature rather than buried in prompt text, improving maintainability of complex pipelines as they evolve.
DSPy assertions allow defining declarative constraints on intermediate outputs that the optimizer must satisfy. A program assertion like dspy.Assert(len(response) < 500, "Response too long") tells the optimizer that this constraint must hold, and programs that violate it receive zero reward. Hard assertions fail immediately on violation; soft suggestions generate a warning that influences the optimizer but does not terminate the program. These constraint mechanisms bridge the gap between the neural (LLM output) and symbolic (business logic) layers of an application, enforcing invariants without relying on post-hoc filtering.
DSPy's compilation metaphor — treating prompt optimization as compilation — changes the developer workflow compared to manual prompt engineering. Developers write the program logic in Python using DSPy modules, define a metric function that scores outputs, provide a small set of training examples (often 10–50), and call compile(). The compiler handles the tedious work of discovering effective instructions and demonstrations. This workflow is fundamentally more systematic than trial-and-error prompt engineering and produces prompts with measured quality against the metric rather than informal impressions from manual testing.
Multi-hop reasoning pipelines in DSPy decompose complex questions into a sequence of retrieval and reasoning steps, with each step retrieving relevant documents and synthesizing partial answers that feed the next step. A typical three-hop pipeline retrieves context for the initial question, identifies what follow-up information is needed, retrieves that follow-up context, and synthesizes a final answer from all retrieved context. DSPy optimizes each step's prompt jointly through a single compile call, rather than requiring the developer to tune each step independently.
Evaluating DSPy programs requires choosing a metric function that accurately captures the quality of the task. For classification tasks, accuracy suffices. For generation tasks — summarization, question answering, code generation — more sophisticated metrics like F1 overlap, semantic similarity, or LLM-as-judge scoring provide better signal. The metric function is the most influential design decision in a DSPy program: a poorly calibrated metric produces optimized prompts that score well but perform poorly on real-world examples, while a well-calibrated metric directs optimization toward genuine quality improvements.