RAG vs Fine-tuning

The decision framework
When to choose RAG
When to choose fine-tuning
When prompting alone is sufficient
Combining RAG and fine-tuning
Cost comparison
Gotchas

SECTION 01

The decision framework

Choosing between RAG, fine-tuning, and prompting is the single most impactful architectural decision in a GenAI system. Many teams default to fine-tuning because it feels more "serious" or technical, but RAG is often faster, cheaper, and more maintainable. The right choice depends on the problem dimensions:

Dimension	RAG wins	Fine-tuning wins	Prompting wins
Knowledge freshness	✓ Real-time updates	✗ Requires retraining	✗ Static
Knowledge volume	✓ Unlimited (external store)	✗ Limited by weights	✗ Limited by context
Style/format adaptation	✗ Hard	✓ Strong	✓ Moderate (via examples)
Task specialisation	✗ Weak	✓ Strong	✓ Moderate
Data privacy	✓ Data stays in retrieval store	✓ Data baked in weights	✗ Data in every prompt
Time to deploy	✓ Days	✗ Weeks–months	✓ Hours

SECTION 02

When to choose RAG

RAG is the right choice when the core requirement is knowledge — accessing facts that the base model doesn't know or that change over time.

Best use cases for RAG:

Enterprise knowledge bases: Internal documents, policies, procedures that change frequently and aren't in the base model's training data.
Customer support: Product documentation, FAQs, troubleshooting guides that need to reference specific, current information.
Research assistants: Access to recent papers, news, or domain-specific corpora.
Compliance and legal: Regulatory documents that must be cited accurately and updated as regulations change.

The killer use case for RAG: any system that must cite its sources. RAG can show users exactly which document chunk informed the answer — fine-tuning cannot.

SECTION 03

When to choose fine-tuning

Fine-tuning is the right choice when the requirement is behaviour — how the model responds — rather than what it knows.

Best use cases for fine-tuning:

Output format consistency: Always returning JSON in a specific schema, always using a particular tone, always structuring responses in a certain way.
Domain language: Medical, legal, or technical terminology where the base model uses incorrect vocabulary or the wrong level of formality.
Instruction following improvements: The base model misunderstands certain task types; fine-tuning on examples teaches it correctly.
Reducing inference cost: A fine-tuned small model (7B) can match a large model (70B) on specific tasks, at 10× lower cost.
Reasoning style: Teaching the model to follow your company's specific analytical framework or reasoning approach.

SECTION 04

When prompting alone is sufficient

# Start here before considering RAG or fine-tuning
# A well-crafted system prompt solves many problems that seem to require more

system_prompt = (
    "You are a customer support agent for Acme Inc. "
    "You help users with billing, account management, and product questions.

"
    "Communication style:
"
    "- Be concise and direct. No filler phrases.
"
    "- Always verify you understand the issue before suggesting solutions.
"
    "- For billing questions, always ask for account number first.
"
    "- Escalate to human agent for: fraud claims, legal threats, medical emergencies.

"
    "Format: Use numbered steps for instructions. Plain prose for explanations."
)

# This handles: tone, format, escalation rules, communication style
# Without: training data, fine-tuning time, model hosting costs

# Add few-shot examples to handle edge cases:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "How do I cancel my subscription?"},
    {"role": "assistant", "content": "I can help with that. To cancel...[example]"},
    {"role": "user", "content": actual_user_question},
]

SECTION 05

Combining RAG and fine-tuning

RAG and fine-tuning are complementary, not mutually exclusive. The pattern that works best for high-stakes production systems:

Start with prompting: Define the task, format, and constraints in a well-crafted system prompt. Measure quality.
Add RAG: Retrieve relevant knowledge from your domain corpus. Measure quality improvement.
Fine-tune the reader: If the model still produces wrong formats or tone, fine-tune it on examples of correct RAG-augmented outputs. This teaches it to use retrieved context correctly.

The fine-tuned RAG model is the gold standard for production: it has domain knowledge (via retrieval), learned citation behaviour (via fine-tuning), and up-to-date facts (via fresh retrieval at query time).

# Hybrid: fine-tune for style/format, RAG for current facts
# Best of both worlds — use when style matters AND knowledge changes frequently

from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("knowledge_base")

def hybrid_query(user_question: str, model: str = "ft:gpt-4o-mini:acme::abc123") -> str:
    """
    model: your fine-tuned model (knows your format/tone/domain vocabulary)
    RAG:   retrieves fresh facts the fine-tune was never trained on
    """
    # Step 1: retrieve relevant context
    results = collection.query(query_texts=[user_question], n_results=3)
    context_chunks = results["documents"][0]
    context = "
---
".join(context_chunks)

    # Step 2: call fine-tuned model with retrieved context
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": (
                "You are Acme Support. Use the provided context. "
                "If context doesn't answer, say so concisely."
            )},
            {"role": "user", "content": f"Context:
{context}

Question: {user_question}"}
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

# Cost comparison rough guide (per 1000 queries, 500 tokens avg):
# Pure prompting (GPT-4o):    ~$2.50   fast to iterate, no infra
# RAG + GPT-4o:               ~$2.60   adds retrieval latency + vector DB cost
# Fine-tuned GPT-4o-mini:     ~$0.15   10-20x cheaper, needs 500+ examples
# Hybrid (FT-mini + RAG):     ~$0.18   best quality/cost for production

SECTION 06

Cost comparison

Rough cost estimates for building a company knowledge base assistant (10,000 documents):

RAG (medium scale): Document ingestion: ~$50 (OpenAI embeddings for 50M tokens). Infrastructure: $200–500/month (vector DB + API calls). Time to deploy: 1–2 weeks. No training data needed.

Fine-tuning (7B model): Training data collection: 4–8 weeks of annotation. Training: $500–5,000 per run on managed services. Hosting: $500–2,000/month (dedicated GPU). Time to deploy: 2–3 months. Fine-tuning needed for each model update.

Verdict: For most knowledge base use cases, RAG has 10–50× lower initial cost and is deployable 10× faster. Fine-tuning becomes cost-effective when: you have 100K+ annotated examples, you need to serve millions of requests/day (amortising hosting cost), or the quality improvement from fine-tuning is critical for your product.

SECTION 07

Gotchas

Fine-tuning doesn't add new knowledge reliably: A common misconception is that fine-tuning on your documents "teaches" the model your domain knowledge. In reality, fine-tuned models have limited ability to recall specific facts trained into their weights, and they hallucinate about details. Use RAG for facts; use fine-tuning for behaviour.

RAG doesn't fix model capability limits: If the base model can't reason well about a topic, adding retrieval won't fix it — the model will misinterpret the retrieved context. If reasoning quality is the issue, consider a more capable base model or fine-tuning on reasoning examples.

Hybrid is often the answer: Production teams that start with "RAG or fine-tuning?" often end up with both. Budget for this possibility when planning your roadmap.

SECTION 08

Decision Flowchart and Cost Model

The build-vs-train decision comes down to three questions: Does the information change frequently? Is the information too large to fit in context? Does the task require a different behaviour pattern, or just different knowledge?

If information changes weekly or more often, RAG is almost always right — re-indexing a vector store costs cents; re-training costs thousands of dollars. If the information is static and fits in a long context window (under 200K tokens), try prompting with the full document first — surprisingly often, this outperforms both RAG and fine-tuning for knowledge-intensive tasks. Fine-tuning earns its place when you need the model to adopt a consistent tone, follow a strict output schema, or perform a narrow task (e.g. SQL generation) at higher reliability than prompting achieves.

# Quick decision heuristic: score your use case
def choose_approach(
    data_updates_weekly: bool,
    data_size_tokens: int,
    needs_style_change: bool,
    budget_usd: float
) -> str:
    if data_updates_weekly:
        return "RAG — data changes too fast for fine-tuning"
    if data_size_tokens < 150_000 and not needs_style_change:
        return "Long-context prompting — cheapest and fastest to iterate"
    if needs_style_change and budget_usd > 500:
        return "Fine-tuning — consistent behaviour change justified"
    if data_size_tokens > 150_000:
        return "RAG — data too large for context window"
    return "RAG + prompting — default safe choice"

# Example
print(choose_approach(
    data_updates_weekly=True,
    data_size_tokens=5_000_000,
    needs_style_change=False,
    budget_usd=200
))  # → RAG

Track the combined cost over a 12-month horizon: fine-tuning has high upfront cost but near-zero per-query cost; RAG has low upfront cost but non-zero retrieval + synthesis cost per query. For high-volume (1M+ queries/month), fine-tuning on a smaller model often beats RAG on total cost of ownership within 6 months.