Design Decisions

RAG vs Fine-tuning

The most important architectural decision in GenAI: when to use retrieval-augmented generation vs fine-tuning vs prompting alone. A decision framework covering knowledge freshness, data availability, latency, cost, and task type.

3 options
RAG / FT / Prompting
Knowledge freshness
Key RAG advantage
Task adaptation
Key FT advantage

Table of Contents

SECTION 01

The decision framework

Choosing between RAG, fine-tuning, and prompting is the single most impactful architectural decision in a GenAI system. Many teams default to fine-tuning because it feels more "serious" or technical, but RAG is often faster, cheaper, and more maintainable. The right choice depends on the problem dimensions:

DimensionRAG winsFine-tuning winsPrompting wins
Knowledge freshnessβœ“ Real-time updatesβœ— Requires retrainingβœ— Static
Knowledge volumeβœ“ Unlimited (external store)βœ— Limited by weightsβœ— Limited by context
Style/format adaptationβœ— Hardβœ“ Strongβœ“ Moderate (via examples)
Task specialisationβœ— Weakβœ“ Strongβœ“ Moderate
Data privacyβœ“ Data stays in retrieval storeβœ“ Data baked in weightsβœ— Data in every prompt
Time to deployβœ“ Daysβœ— Weeks–monthsβœ“ Hours
SECTION 02

When to choose RAG

RAG is the right choice when the core requirement is knowledge β€” accessing facts that the base model doesn't know or that change over time.

Best use cases for RAG:

The killer use case for RAG: any system that must cite its sources. RAG can show users exactly which document chunk informed the answer β€” fine-tuning cannot.

SECTION 03

When to choose fine-tuning

Fine-tuning is the right choice when the requirement is behaviour β€” how the model responds β€” rather than what it knows.

Best use cases for fine-tuning:

SECTION 04

When prompting alone is sufficient

# Start here before considering RAG or fine-tuning
# A well-crafted system prompt solves many problems that seem to require more

system_prompt = (
    "You are a customer support agent for Acme Inc. "
    "You help users with billing, account management, and product questions.

"
    "Communication style:
"
    "- Be concise and direct. No filler phrases.
"
    "- Always verify you understand the issue before suggesting solutions.
"
    "- For billing questions, always ask for account number first.
"
    "- Escalate to human agent for: fraud claims, legal threats, medical emergencies.

"
    "Format: Use numbered steps for instructions. Plain prose for explanations."
)

# This handles: tone, format, escalation rules, communication style
# Without: training data, fine-tuning time, model hosting costs

# Add few-shot examples to handle edge cases:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "How do I cancel my subscription?"},
    {"role": "assistant", "content": "I can help with that. To cancel...[example]"},
    {"role": "user", "content": actual_user_question},
]
SECTION 05

Combining RAG and fine-tuning

RAG and fine-tuning are complementary, not mutually exclusive. The pattern that works best for high-stakes production systems:

  1. Start with prompting: Define the task, format, and constraints in a well-crafted system prompt. Measure quality.
  2. Add RAG: Retrieve relevant knowledge from your domain corpus. Measure quality improvement.
  3. Fine-tune the reader: If the model still produces wrong formats or tone, fine-tune it on examples of correct RAG-augmented outputs. This teaches it to use retrieved context correctly.

The fine-tuned RAG model is the gold standard for production: it has domain knowledge (via retrieval), learned citation behaviour (via fine-tuning), and up-to-date facts (via fresh retrieval at query time).

# Hybrid: fine-tune for style/format, RAG for current facts
# Best of both worlds β€” use when style matters AND knowledge changes frequently

from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("knowledge_base")

def hybrid_query(user_question: str, model: str = "ft:gpt-4o-mini:acme::abc123") -> str:
    """
    model: your fine-tuned model (knows your format/tone/domain vocabulary)
    RAG:   retrieves fresh facts the fine-tune was never trained on
    """
    # Step 1: retrieve relevant context
    results = collection.query(query_texts=[user_question], n_results=3)
    context_chunks = results["documents"][0]
    context = "
---
".join(context_chunks)

    # Step 2: call fine-tuned model with retrieved context
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": (
                "You are Acme Support. Use the provided context. "
                "If context doesn't answer, say so concisely."
            )},
            {"role": "user", "content": f"Context:
{context}

Question: {user_question}"}
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

# Cost comparison rough guide (per 1000 queries, 500 tokens avg):
# Pure prompting (GPT-4o):    ~$2.50   fast to iterate, no infra
# RAG + GPT-4o:               ~$2.60   adds retrieval latency + vector DB cost
# Fine-tuned GPT-4o-mini:     ~$0.15   10-20x cheaper, needs 500+ examples
# Hybrid (FT-mini + RAG):     ~$0.18   best quality/cost for production
SECTION 06

Cost comparison

Rough cost estimates for building a company knowledge base assistant (10,000 documents):

RAG (medium scale): Document ingestion: ~$50 (OpenAI embeddings for 50M tokens). Infrastructure: $200–500/month (vector DB + API calls). Time to deploy: 1–2 weeks. No training data needed.

Fine-tuning (7B model): Training data collection: 4–8 weeks of annotation. Training: $500–5,000 per run on managed services. Hosting: $500–2,000/month (dedicated GPU). Time to deploy: 2–3 months. Fine-tuning needed for each model update.

Verdict: For most knowledge base use cases, RAG has 10–50Γ— lower initial cost and is deployable 10Γ— faster. Fine-tuning becomes cost-effective when: you have 100K+ annotated examples, you need to serve millions of requests/day (amortising hosting cost), or the quality improvement from fine-tuning is critical for your product.

SECTION 07

Gotchas

Fine-tuning doesn't add new knowledge reliably: A common misconception is that fine-tuning on your documents "teaches" the model your domain knowledge. In reality, fine-tuned models have limited ability to recall specific facts trained into their weights, and they hallucinate about details. Use RAG for facts; use fine-tuning for behaviour.

RAG doesn't fix model capability limits: If the base model can't reason well about a topic, adding retrieval won't fix it β€” the model will misinterpret the retrieved context. If reasoning quality is the issue, consider a more capable base model or fine-tuning on reasoning examples.

Hybrid is often the answer: Production teams that start with "RAG or fine-tuning?" often end up with both. Budget for this possibility when planning your roadmap.

SECTION 08

Decision Flowchart and Cost Model

The build-vs-train decision comes down to three questions: Does the information change frequently? Is the information too large to fit in context? Does the task require a different behaviour pattern, or just different knowledge?

If information changes weekly or more often, RAG is almost always right β€” re-indexing a vector store costs cents; re-training costs thousands of dollars. If the information is static and fits in a long context window (under 200K tokens), try prompting with the full document first β€” surprisingly often, this outperforms both RAG and fine-tuning for knowledge-intensive tasks. Fine-tuning earns its place when you need the model to adopt a consistent tone, follow a strict output schema, or perform a narrow task (e.g. SQL generation) at higher reliability than prompting achieves.

# Quick decision heuristic: score your use case
def choose_approach(
    data_updates_weekly: bool,
    data_size_tokens: int,
    needs_style_change: bool,
    budget_usd: float
) -> str:
    if data_updates_weekly:
        return "RAG β€” data changes too fast for fine-tuning"
    if data_size_tokens < 150_000 and not needs_style_change:
        return "Long-context prompting β€” cheapest and fastest to iterate"
    if needs_style_change and budget_usd > 500:
        return "Fine-tuning β€” consistent behaviour change justified"
    if data_size_tokens > 150_000:
        return "RAG β€” data too large for context window"
    return "RAG + prompting β€” default safe choice"

# Example
print(choose_approach(
    data_updates_weekly=True,
    data_size_tokens=5_000_000,
    needs_style_change=False,
    budget_usd=200
))  # β†’ RAG

Track the combined cost over a 12-month horizon: fine-tuning has high upfront cost but near-zero per-query cost; RAG has low upfront cost but non-zero retrieval + synthesis cost per query. For high-volume (1M+ queries/month), fine-tuning on a smaller model often beats RAG on total cost of ownership within 6 months.