The most important architectural decision in GenAI: when to use retrieval-augmented generation vs fine-tuning vs prompting alone. A decision framework covering knowledge freshness, data availability, latency, cost, and task type.
Choosing between RAG, fine-tuning, and prompting is the single most impactful architectural decision in a GenAI system. Many teams default to fine-tuning because it feels more "serious" or technical, but RAG is often faster, cheaper, and more maintainable. The right choice depends on the problem dimensions:
| Dimension | RAG wins | Fine-tuning wins | Prompting wins |
|---|---|---|---|
| Knowledge freshness | β Real-time updates | β Requires retraining | β Static |
| Knowledge volume | β Unlimited (external store) | β Limited by weights | β Limited by context |
| Style/format adaptation | β Hard | β Strong | β Moderate (via examples) |
| Task specialisation | β Weak | β Strong | β Moderate |
| Data privacy | β Data stays in retrieval store | β Data baked in weights | β Data in every prompt |
| Time to deploy | β Days | β Weeksβmonths | β Hours |
RAG is the right choice when the core requirement is knowledge β accessing facts that the base model doesn't know or that change over time.
Best use cases for RAG:
The killer use case for RAG: any system that must cite its sources. RAG can show users exactly which document chunk informed the answer β fine-tuning cannot.
Fine-tuning is the right choice when the requirement is behaviour β how the model responds β rather than what it knows.
Best use cases for fine-tuning:
# Start here before considering RAG or fine-tuning
# A well-crafted system prompt solves many problems that seem to require more
system_prompt = (
"You are a customer support agent for Acme Inc. "
"You help users with billing, account management, and product questions.
"
"Communication style:
"
"- Be concise and direct. No filler phrases.
"
"- Always verify you understand the issue before suggesting solutions.
"
"- For billing questions, always ask for account number first.
"
"- Escalate to human agent for: fraud claims, legal threats, medical emergencies.
"
"Format: Use numbered steps for instructions. Plain prose for explanations."
)
# This handles: tone, format, escalation rules, communication style
# Without: training data, fine-tuning time, model hosting costs
# Add few-shot examples to handle edge cases:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "How do I cancel my subscription?"},
{"role": "assistant", "content": "I can help with that. To cancel...[example]"},
{"role": "user", "content": actual_user_question},
]
RAG and fine-tuning are complementary, not mutually exclusive. The pattern that works best for high-stakes production systems:
The fine-tuned RAG model is the gold standard for production: it has domain knowledge (via retrieval), learned citation behaviour (via fine-tuning), and up-to-date facts (via fresh retrieval at query time).
# Hybrid: fine-tune for style/format, RAG for current facts
# Best of both worlds β use when style matters AND knowledge changes frequently
from openai import OpenAI
import chromadb
client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("knowledge_base")
def hybrid_query(user_question: str, model: str = "ft:gpt-4o-mini:acme::abc123") -> str:
"""
model: your fine-tuned model (knows your format/tone/domain vocabulary)
RAG: retrieves fresh facts the fine-tune was never trained on
"""
# Step 1: retrieve relevant context
results = collection.query(query_texts=[user_question], n_results=3)
context_chunks = results["documents"][0]
context = "
---
".join(context_chunks)
# Step 2: call fine-tuned model with retrieved context
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": (
"You are Acme Support. Use the provided context. "
"If context doesn't answer, say so concisely."
)},
{"role": "user", "content": f"Context:
{context}
Question: {user_question}"}
],
temperature=0.2,
)
return response.choices[0].message.content
# Cost comparison rough guide (per 1000 queries, 500 tokens avg):
# Pure prompting (GPT-4o): ~$2.50 fast to iterate, no infra
# RAG + GPT-4o: ~$2.60 adds retrieval latency + vector DB cost
# Fine-tuned GPT-4o-mini: ~$0.15 10-20x cheaper, needs 500+ examples
# Hybrid (FT-mini + RAG): ~$0.18 best quality/cost for production
Rough cost estimates for building a company knowledge base assistant (10,000 documents):
RAG (medium scale): Document ingestion: ~$50 (OpenAI embeddings for 50M tokens). Infrastructure: $200β500/month (vector DB + API calls). Time to deploy: 1β2 weeks. No training data needed.
Fine-tuning (7B model): Training data collection: 4β8 weeks of annotation. Training: $500β5,000 per run on managed services. Hosting: $500β2,000/month (dedicated GPU). Time to deploy: 2β3 months. Fine-tuning needed for each model update.
Verdict: For most knowledge base use cases, RAG has 10β50Γ lower initial cost and is deployable 10Γ faster. Fine-tuning becomes cost-effective when: you have 100K+ annotated examples, you need to serve millions of requests/day (amortising hosting cost), or the quality improvement from fine-tuning is critical for your product.
Fine-tuning doesn't add new knowledge reliably: A common misconception is that fine-tuning on your documents "teaches" the model your domain knowledge. In reality, fine-tuned models have limited ability to recall specific facts trained into their weights, and they hallucinate about details. Use RAG for facts; use fine-tuning for behaviour.
RAG doesn't fix model capability limits: If the base model can't reason well about a topic, adding retrieval won't fix it β the model will misinterpret the retrieved context. If reasoning quality is the issue, consider a more capable base model or fine-tuning on reasoning examples.
Hybrid is often the answer: Production teams that start with "RAG or fine-tuning?" often end up with both. Budget for this possibility when planning your roadmap.
The build-vs-train decision comes down to three questions: Does the information change frequently? Is the information too large to fit in context? Does the task require a different behaviour pattern, or just different knowledge?
If information changes weekly or more often, RAG is almost always right β re-indexing a vector store costs cents; re-training costs thousands of dollars. If the information is static and fits in a long context window (under 200K tokens), try prompting with the full document first β surprisingly often, this outperforms both RAG and fine-tuning for knowledge-intensive tasks. Fine-tuning earns its place when you need the model to adopt a consistent tone, follow a strict output schema, or perform a narrow task (e.g. SQL generation) at higher reliability than prompting achieves.
# Quick decision heuristic: score your use case
def choose_approach(
data_updates_weekly: bool,
data_size_tokens: int,
needs_style_change: bool,
budget_usd: float
) -> str:
if data_updates_weekly:
return "RAG β data changes too fast for fine-tuning"
if data_size_tokens < 150_000 and not needs_style_change:
return "Long-context prompting β cheapest and fastest to iterate"
if needs_style_change and budget_usd > 500:
return "Fine-tuning β consistent behaviour change justified"
if data_size_tokens > 150_000:
return "RAG β data too large for context window"
return "RAG + prompting β default safe choice"
# Example
print(choose_approach(
data_updates_weekly=True,
data_size_tokens=5_000_000,
needs_style_change=False,
budget_usd=200
)) # β RAG
Track the combined cost over a 12-month horizon: fine-tuning has high upfront cost but near-zero per-query cost; RAG has low upfront cost but non-zero retrieval + synthesis cost per query. For high-volume (1M+ queries/month), fine-tuning on a smaller model often beats RAG on total cost of ownership within 6 months.