01 — The Problem
Beyond Manual Prompt Engineering
Manual prompting: write a prompt, test on examples, tweak wording, repeat. Brittle — small wording changes cause large quality swings. Doesn't scale.
Programmatic prompting: define what you want (task signature, metric), let a framework optimize how to get it (prompt wording, few-shot examples, chain structure)
Key insight: prompts are hyperparameters. They should be optimized on training data, not tuned by intuition.
⚠️
Manual prompt engineering reaches a ceiling quickly. For tasks with >100 labeled examples, programmatic optimization (DSPy, OPRO, APE) consistently outperforms hand-tuned prompts.
02 — Foundation
Prompt Templates and Jinja2
Template engines: parameterize prompts so dynamic content is cleanly separated from instructions.
Jinja2: Python's standard template engine. Supports conditionals, loops, filters — useful for complex prompt construction.
LangChain PromptTemplate: wraps Jinja2 with LLM-specific tooling (message formatting, partial templates, composition)
Example: Jinja2 + LangChain Prompt Template
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
# System template with conditional sections
system_template = """You are a {role} assistant.
{% if context %}
Use this context to answer questions:
{{ context }}
{% endif %}
{% if output_format == "json" %}
Always respond in valid JSON.
{% endif %}
"""
prompt = ChatPromptTemplate.from_messages([
("system", system_template),
MessagesPlaceholder("history"), # dynamic conversation history
("human", "{question}")
])
# Compose with LLM
from langchain_openai import ChatOpenAI
chain = prompt | ChatOpenAI(model="gpt-4o") | StrOutputParser()
result = chain.invoke({
"role": "financial analyst",
"context": retrieved_docs,
"output_format": "json",
"history": conversation_history,
"question": "What was Q3 revenue growth?"
})
03 — Framework
DSPy: Compile Your Prompts
DSPy (Declarative Self-improving Python): define your task as a typed Signature, compose Modules (Predict, ChainOfThought, ReAct), then compile with an Optimizer that finds the best prompts + few-shot examples automatically.
No manual prompt strings in your code. The optimizer writes the prompts.
Signatures: declare inputs, outputs, and docstring description of the task. DSPy infers the prompt.
Example: DSPy Classification Pipeline
import dspy
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# 1. Define task signature
class SentimentClassifier(dspy.Signature):
"""Classify the sentiment of customer feedback."""
feedback: str = dspy.InputField()
sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()
confidence: float = dspy.OutputField(desc="0.0 to 1.0")
# 2. Build module
classifier = dspy.Predict(SentimentClassifier)
# 3. Compile with optimizer (finds best few-shot examples)
optimizer = dspy.BootstrapFewShot(metric=accuracy_metric, max_bootstrapped_demos=4)
compiled = optimizer.compile(classifier, trainset=train_examples)
# 4. Use
result = compiled(feedback="The delivery was fast but packaging was damaged")
print(result.sentiment, result.confidence)
DSPy Modules
| Module | What it does | When to use |
| Predict | Single LLM call with signature | Classification, extraction |
| ChainOfThought | Adds reasoning field | Math, logic, analysis |
| ReAct | Tool-use + reasoning loop | Agents, multi-step tasks |
| MultiChainComparison | Multiple chains, pick best | High-stakes decisions |
| Retrieve | RAG retrieval step | Any RAG pipeline |
04 — Optimization
DSPy Optimizers
BootstrapFewShot: runs your program on training examples, identifies successful traces, uses them as few-shot examples — automatic few-shot selection.
MIPRO (v2): optimizes both instructions AND few-shot examples simultaneously using Bayesian search over prompt candidates.
BootstrapFinetune: instead of in-context few-shot, fine-tunes the model weights on bootstrapped traces.
Example: MIPRO Optimization
from dspy.teleprompt import MIPROv2
optimizer = MIPROv2(
metric=my_metric,
auto="medium", # "light" / "medium" / "heavy" — controls search budget
num_threads=8
)
compiled_program = optimizer.compile(
my_program,
trainset=train_data, # labeled examples for optimization
valset=val_data, # held-out for optimizer eval
requires_permission_to_run=False
)
# compiled_program has optimized prompts + few-shot examples
# Typically 10-30% better than manually written prompts
✓
MIPROv2 with auto="medium" is the current recommended default for most DSPy programs. It takes 30–60 minutes but finds prompts that consistently outperform manual engineering.
05 — Composition
LangChain Expression Language (LCEL)
LCEL: pipe-based composition of LangChain components. Chain = prompt | model | parser.
Supports: streaming, async, parallel branches, fallbacks, retries — all composable
Example: Parallel Chain with LCEL
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
# Two parallel analysis paths
parallel_chain = RunnableParallel(
summary=summary_prompt | llm | StrOutputParser(),
risks=risks_prompt | llm | StrOutputParser(),
original=RunnablePassthrough()
)
# Full pipeline: retrieve → analyze in parallel → synthesize
full_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| parallel_chain
| synthesis_prompt
| llm
| StrOutputParser()
)
result = full_chain.invoke("What are the key risks in Q3 earnings?")
# result contains summary, risks, and synthesis in one call
06 — Advanced
Automatic Prompt Optimization (APE, OPRO)
APE (Automatic Prompt Engineer): generate candidate instruction paraphrases using LLM, evaluate each on dev set, select best-performing instruction.
OPRO (Optimization by Prompting): frame prompt optimization as a meta-prompt problem — feed current prompt + performance scores to LLM, ask it to suggest improvements, iterate.
Example: Simple OPRO Loop
def opro_optimize(task_desc: str, examples: list, metric, iterations=10):
current_prompt = task_desc
history = []
for i in range(iterations):
score = evaluate(current_prompt, examples, metric)
history.append({"prompt": current_prompt, "score": score})
# Ask LLM to improve the prompt
meta_prompt = f"""
You are optimizing an LLM prompt. Here are previous attempts and their scores:
{history[-5:]} # last 5 attempts
The task: {task_desc}
Suggest a better prompt that might score higher. Output ONLY the new prompt."""
current_prompt = llm.invoke(meta_prompt)
return max(history, key=lambda x: x["score"])["prompt"]
⚠️
OPRO and APE require labeled evaluation data. The quality of your metric function directly caps the quality of the optimized prompt. Garbage metric → garbage prompt.
07 — Decision Guide
When to Use Each Approach
1
Manual Prompting First — the baseline
Always start here. If you can solve the task with a well-written prompt and <50 examples, you don't need programmatic optimization. Spend time on your evaluation metric instead.
2
DSPy When You Have Labeled Data — the standard
If you have 100+ labeled (input, output) examples and a clear metric, DSPy optimization will outperform manual prompting. Start with BootstrapFewShot.
3
LCEL for Complex Chains — the pattern
When your pipeline has multiple LLM calls, parallel branches, retrieval, or conditional routing, LCEL's composability and streaming support pay off.
4
Fine-tuning as the Final Step — the ultimate
When programmatic prompting plateaus, use DSPy's BootstrapFinetune or standard SFT to bake the optimized behavior into model weights.
Tools Grid
Further Reading
References
Academic Papers
- Paper Khattab, O. et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714 — arxiv:2310.03714 ↗
- Paper Yang, G. et al. (2023). Large Language Models as Optimizers. OPRO paper. arXiv:2309.03409 — arxiv:2309.03409 ↗
- Paper Zhou, Y. et al. (2022). Large Language Models Are Human-Level Prompt Engineers. APE paper. arXiv:2211.01910 — arxiv:2211.01910 ↗
Documentation
Practitioner Resources
- Blog Khattab, O. (2023). DSPy: Compiling Language Models into Self-Improving Pipelines. Intro and walkthrough — dspy.ai/blog ↗