Combining prompting, retrieval, agents, and fine-tuning into a working system. Learn how to choose the right tool for each problem.
Most tutorials teach each GenAI technique in isolation: here's how to prompt, here's how to build RAG, here's how to build agents. But in practice, you rarely use just one technique. Real applications combine multiple blocks depending on the problem constraints. This page teaches you the decision framework for choosing between them and how they fit together.
Prompting is the foundation. Feed an LLM a task and get an answer. No external data, no tools, no loops. It's fast, cheap, and deterministic — but it only works for tasks the model can solve from its training data.
Retrieval-Augmented Generation (RAG) solves the knowledge problem. When your task requires specific facts, documents, or domain data the model doesn't know, retrieve relevant context from a vector database or search index and inject it into the prompt. Now the model answers from facts, not from faulty memory.
Agents solve the action problem. When the task requires multiple steps, external data, or decisions based on observation, give the LLM tools and let it call them in a loop. It reasons, acts, observes, and repeats until the goal is met.
Fine-Tuning optimizes for task-specific performance. When prompting doesn't give enough accuracy, or when you need task-specific reasoning patterns or output formats, train the model on examples of correct behavior. This improves quality but costs time and data.
Data Engineering is the glue. Clean, structured, labeled data feeds every technique above. Without good data, prompting fails, RAG retrieves garbage, agents make wrong decisions, and fine-tuning learns from noise.
Use this decision tree to choose the right combination of techniques for your problem. Answer each question in order and follow the path.
Start at the top with "Can prompting alone solve this?" and follow branches down. Each decision eliminates options and narrows the solution. The tree encodes the principle: use the simplest approach that solves the problem. Never use agents when RAG works. Never fine-tune when prompting is sufficient.
Each block solves a specific class of problems. Know what each is for, when it works, and when it fails.
What: Feed a task to an LLM and get an answer. When to use: Task knowledge is in the model's training data, and answer depends only on reasoning or common knowledge. Examples: Summarizing text, answering trivia, drafting emails, classifying sentiment. Latency: One LLM call, ~100ms-1s. Cost: Low. Failure mode: Model doesn't know the facts, hallucinates.
What: Retrieve relevant documents from a database, inject them into the prompt, then let the LLM answer from that context. When to use: Task requires specific facts, policies, or domain data that aren't in the model's training data. Examples: Customer support, legal document Q&A, internal knowledge search, product documentation. Latency: Retrieval (~50-200ms) + LLM call (~100ms-1s). Cost: Moderate (vector DB lookups + LLM). Failure mode: Retriever doesn't find relevant documents, or documents don't answer the question.
What: Give the LLM tools and let it decide when to use them. The agent loops: reason about task, call a tool, observe the result, repeat until done. When to use: Task requires multiple steps, external data that changes dynamically, or decisions based on intermediate results. Examples: Multi-step research, debugging code, booking travel, exploring a database. Latency: Multiple LLM calls (3-10x slower than single call). Cost: High (per-step LLM calls). Failure mode: Agent picks wrong tools, loops infinitely, or forgets context.
What: Train an LLM on examples of correct behavior for your specific task, then use the tuned model for inference. When to use: Prompting gives 60-80% accuracy and you need 90%+. Task has specific output format, tone, or reasoning pattern. Examples: Specialized classification, domain-specific content generation, output parsing. Latency: Same as prompting (~100ms-1s). Cost: High upfront (training), then low per-call. Failure mode: Not enough training data, model overfits to examples, training teaches wrong patterns.
What: Clean, validate, label, and structure your data so that prompting, RAG, and fine-tuning work better. When to use: Raw data is messy, incomplete, or inconsistent. You need reliable ground truth for evaluation or training. Examples: Cleaning customer records for fine-tuning, labeling examples for evaluation, extracting structured fields from unstructured text. Latency: Offline work, not part of inference. Cost: High (manual labeling or automation). Failure mode: Labels are wrong or biased, affecting all downstream tasks.
Every GenAI technique can be built from scratch or bought as a managed service. Know when each makes sense.
Prompting: Always build. Prompting is free and requires no infrastructure. RAG: Build if you have full control of documents and queries. Buy if you need commercial-grade retrieval, enterprise scaling, or managed updates. Agents: Build simple agents in-house using frameworks like LangGraph. Buy if you need multi-agent orchestration, human-in-the-loop, or production monitoring. Fine-tuning: Build if you have compute budget and 100+ labeled examples. Buy managed tuning (OpenAI, Anthropic) if you want hands-off training. Data engineering: Always build this in-house — it's domain-specific and iterative.
Managed services (LLamaIndex, Pinecone, LangChain+Platform, Anthropic Workbench) reduce operational overhead. They handle infrastructure, scaling, monitoring, and updates. Cost per request is higher, but total cost is often lower because you don't pay for engineering time and ops burden.
| Approach | Best For | Pros | Cons |
|---|---|---|---|
| Managed API (OpenAI, Anthropic) | Prototyping, variable load | Zero infra, latest models, fast to ship | Per-token cost at scale, data leaves your boundary |
| Open model + cloud GPU | Cost-sensitive, medium volume | Predictable cost, data stays in VPC | Ops overhead, GPU reservation needed |
| Fine-tuned model | Domain-specific tasks | Best task performance, smaller model possible | Training cost, eval overhead, drift risk |
| Framework (LangChain) | Complex pipelines, RAG, agents | Batteries included, large community | Abstraction leaks, harder to debug |
| Custom pipeline | High performance, unusual needs | Full control, minimal overhead | Highest build cost, must build retries, observability |
Frameworks wrap the five blocks into higher-level abstractions. They handle common patterns, reduce boilerplate, and add features like memory, monitoring, and multi-step pipelines.
State machines for agents. Explicit control flow, human-in-the-loop, streaming. Better for production agents than basic LangChain.
Purpose-built for RAG. Data loaders, indexing, retrieval, evaluation. Stronger than LangChain for production RAG systems.
Composable LLM pipelines with automatic optimization. Define what you want, DSPy optimizes prompts and fine-tunes models.
Here's how a real application combines multiple blocks. This is a code documentation Q&A system: user asks questions about code, system retrieves relevant files (RAG), then uses an agent to search docs and run code analysis, finally answers with citations.
User asks: "How do I authenticate with the payment API?" RAG kicks in: Retrieves payment module files. Agent reasons: "I need more details about authentication patterns." Agent acts: Calls search_docs for "OAuth". Prompting concludes: LLM synthesizes answer from context and tool results. User gets: Answer with file citations.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
# Global settings
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.0)
Settings.chunk_size = 512
Settings.chunk_overlap = 64
# 1. Load documents
documents = SimpleDirectoryReader("./docs").load_data()
# 2. Parse into chunks
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)
print(f"Parsed {len(documents)} docs into {len(nodes)} chunks")
# 3. Build vector index (embeddings computed here)
index = VectorStoreIndex(nodes, show_progress=True)
# 4. Query
engine = index.as_query_engine(similarity_top_k=5)
response = engine.query(
"What are the key trade-offs between RAG and fine-tuning?"
)
print(response)
# 5. Persist index for reuse (avoids re-embedding on restart)
index.storage_context.persist("./index_store")
# Reload later:
# from llama_index.core import StorageContext, load_index_from_storage
# storage_ctx = StorageContext.from_defaults(persist_dir="./index_store")
# index = load_index_from_storage(storage_ctx)
Each technique deserves its own deep dive. Start with whichever block you're weakest on, then explore them in order of your application's complexity.
Prompt engineering, few-shot learning, chain-of-thought reasoning, and how to get the best results from an LLM without external data.
Agent architecture: tool calling, ReAct loops, planning, memory, and frameworks for multi-step autonomous behavior.
Model adaptation: training procedures, data labeling, hyperparameter tuning, and when fine-tuning beats prompting.
Data infrastructure: cleaning, labeling, versioning, and pipelines that make every other block work reliably.
Before shipping a GenAI application, ensure you've covered these across all five blocks.