Prompt Engineering

Programmatic Prompting

DSPy, LangChain chains, prompt templates, and optimizing prompts with labeled data instead of manual iteration

declare → compile → optimize the DSPy workflow
few-shot auto-selection the key innovation
metric-driven refinement the shift from manual prompting
Contents
  1. Beyond manual prompting
  2. Prompt templates
  3. DSPy: compile prompts
  4. DSPy optimizers
  5. LangChain LCEL
  6. Automatic optimization
  7. When to use each
01 — The Problem

Beyond Manual Prompt Engineering

Manual prompting: write a prompt, test on examples, tweak wording, repeat. Brittle — small wording changes cause large quality swings. Doesn't scale.

Programmatic prompting: define what you want (task signature, metric), let a framework optimize how to get it (prompt wording, few-shot examples, chain structure)

Key insight: prompts are hyperparameters. They should be optimized on training data, not tuned by intuition.

⚠️ Manual prompt engineering reaches a ceiling quickly. For tasks with >100 labeled examples, programmatic optimization (DSPy, OPRO, APE) consistently outperforms hand-tuned prompts.
02 — Foundation

Prompt Templates and Jinja2

Template engines: parameterize prompts so dynamic content is cleanly separated from instructions.

Jinja2: Python's standard template engine. Supports conditionals, loops, filters — useful for complex prompt construction.

LangChain PromptTemplate: wraps Jinja2 with LLM-specific tooling (message formatting, partial templates, composition)

Example: Jinja2 + LangChain Prompt Template

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder # System template with conditional sections system_template = """You are a {role} assistant. {% if context %} Use this context to answer questions: {{ context }} {% endif %} {% if output_format == "json" %} Always respond in valid JSON. {% endif %} """ prompt = ChatPromptTemplate.from_messages([ ("system", system_template), MessagesPlaceholder("history"), # dynamic conversation history ("human", "{question}") ]) # Compose with LLM from langchain_openai import ChatOpenAI chain = prompt | ChatOpenAI(model="gpt-4o") | StrOutputParser() result = chain.invoke({ "role": "financial analyst", "context": retrieved_docs, "output_format": "json", "history": conversation_history, "question": "What was Q3 revenue growth?" })
03 — Framework

DSPy: Compile Your Prompts

DSPy (Declarative Self-improving Python): define your task as a typed Signature, compose Modules (Predict, ChainOfThought, ReAct), then compile with an Optimizer that finds the best prompts + few-shot examples automatically.

No manual prompt strings in your code. The optimizer writes the prompts.

Signatures: declare inputs, outputs, and docstring description of the task. DSPy infers the prompt.

Example: DSPy Classification Pipeline

import dspy lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=lm) # 1. Define task signature class SentimentClassifier(dspy.Signature): """Classify the sentiment of customer feedback.""" feedback: str = dspy.InputField() sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField() confidence: float = dspy.OutputField(desc="0.0 to 1.0") # 2. Build module classifier = dspy.Predict(SentimentClassifier) # 3. Compile with optimizer (finds best few-shot examples) optimizer = dspy.BootstrapFewShot(metric=accuracy_metric, max_bootstrapped_demos=4) compiled = optimizer.compile(classifier, trainset=train_examples) # 4. Use result = compiled(feedback="The delivery was fast but packaging was damaged") print(result.sentiment, result.confidence)

DSPy Modules

ModuleWhat it doesWhen to use
PredictSingle LLM call with signatureClassification, extraction
ChainOfThoughtAdds reasoning fieldMath, logic, analysis
ReActTool-use + reasoning loopAgents, multi-step tasks
MultiChainComparisonMultiple chains, pick bestHigh-stakes decisions
RetrieveRAG retrieval stepAny RAG pipeline
04 — Optimization

DSPy Optimizers

BootstrapFewShot: runs your program on training examples, identifies successful traces, uses them as few-shot examples — automatic few-shot selection.

MIPRO (v2): optimizes both instructions AND few-shot examples simultaneously using Bayesian search over prompt candidates.

BootstrapFinetune: instead of in-context few-shot, fine-tunes the model weights on bootstrapped traces.

Example: MIPRO Optimization

from dspy.teleprompt import MIPROv2 optimizer = MIPROv2( metric=my_metric, auto="medium", # "light" / "medium" / "heavy" — controls search budget num_threads=8 ) compiled_program = optimizer.compile( my_program, trainset=train_data, # labeled examples for optimization valset=val_data, # held-out for optimizer eval requires_permission_to_run=False ) # compiled_program has optimized prompts + few-shot examples # Typically 10-30% better than manually written prompts
MIPROv2 with auto="medium" is the current recommended default for most DSPy programs. It takes 30–60 minutes but finds prompts that consistently outperform manual engineering.
05 — Composition

LangChain Expression Language (LCEL)

LCEL: pipe-based composition of LangChain components. Chain = prompt | model | parser.

Supports: streaming, async, parallel branches, fallbacks, retries — all composable

Example: Parallel Chain with LCEL

from langchain_core.runnables import RunnableParallel, RunnablePassthrough from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o-mini") # Two parallel analysis paths parallel_chain = RunnableParallel( summary=summary_prompt | llm | StrOutputParser(), risks=risks_prompt | llm | StrOutputParser(), original=RunnablePassthrough() ) # Full pipeline: retrieve → analyze in parallel → synthesize full_chain = ( {"context": retriever, "question": RunnablePassthrough()} | parallel_chain | synthesis_prompt | llm | StrOutputParser() ) result = full_chain.invoke("What are the key risks in Q3 earnings?") # result contains summary, risks, and synthesis in one call
06 — Advanced

Automatic Prompt Optimization (APE, OPRO)

APE (Automatic Prompt Engineer): generate candidate instruction paraphrases using LLM, evaluate each on dev set, select best-performing instruction.

OPRO (Optimization by Prompting): frame prompt optimization as a meta-prompt problem — feed current prompt + performance scores to LLM, ask it to suggest improvements, iterate.

Example: Simple OPRO Loop

def opro_optimize(task_desc: str, examples: list, metric, iterations=10): current_prompt = task_desc history = [] for i in range(iterations): score = evaluate(current_prompt, examples, metric) history.append({"prompt": current_prompt, "score": score}) # Ask LLM to improve the prompt meta_prompt = f""" You are optimizing an LLM prompt. Here are previous attempts and their scores: {history[-5:]} # last 5 attempts The task: {task_desc} Suggest a better prompt that might score higher. Output ONLY the new prompt.""" current_prompt = llm.invoke(meta_prompt) return max(history, key=lambda x: x["score"])["prompt"]
⚠️ OPRO and APE require labeled evaluation data. The quality of your metric function directly caps the quality of the optimized prompt. Garbage metric → garbage prompt.
07 — Decision Guide

When to Use Each Approach

1

Manual Prompting First — the baseline

Always start here. If you can solve the task with a well-written prompt and <50 examples, you don't need programmatic optimization. Spend time on your evaluation metric instead.

2

DSPy When You Have Labeled Data — the standard

If you have 100+ labeled (input, output) examples and a clear metric, DSPy optimization will outperform manual prompting. Start with BootstrapFewShot.

3

LCEL for Complex Chains — the pattern

When your pipeline has multiple LLM calls, parallel branches, retrieval, or conditional routing, LCEL's composability and streaming support pay off.

4

Fine-tuning as the Final Step — the ultimate

When programmatic prompting plateaus, use DSPy's BootstrapFinetune or standard SFT to bake the optimized behavior into model weights.

Tools Grid

Framework
DSPy
Declarative signatures, automatic compilation, optimization
Composition
LangChain (LCEL)
Pipe-based composition, streaming, async
Orchestration
LangGraph
Stateful chains, conditional routing, loops
Output Structure
instructor
Pydantic validation, structured outputs
Testing
PromptFoo
Prompt testing, comparison, CI/CD
Monitoring
Braintrust
Evaluation, monitoring, dataset management
Optimization
W&B Prompts
Prompt versioning, experimentation
IDE
OpenAI Playground
Quick prototyping and testing
Further Reading

References

Academic Papers
Documentation
Practitioner Resources