01 — Foundation
How LLMs Are Built
Large Language Models are not trained end-to-end for your specific task. Instead, they follow a three-stage pipeline: pretraining (unsupervised learning on vast text), supervised fine-tuning (SFT, instruction-following), and reinforcement learning from human feedback (RLHF, preference alignment).
Stage 1: Pretraining
Models are trained on hundreds of billions of tokens of diverse internet text using next-token prediction. This is unsupervised learning — no labels, just pattern-matching. The goal is to absorb world knowledge, reasoning patterns, and language structure. Pretraining is expensive (millions of dollars) and done only by a few organizations.
Stage 2: Supervised Fine-Tuning (SFT)
After pretraining, models still refuse to follow instructions and generate unhelpful outputs. SFT teaches instruction-following by fine-tuning on thousands of (prompt, response) pairs. The response is ideal: helpful, harmless, honest. SFT anchors the model to human intent.
Stage 3: Reinforcement Learning from Human Feedback (RLHF)
SFT still leaves gaps: models can't rank which outputs are better, only follow instructions. RLHF fixes this. Humans rank model outputs pairwise, and a reward model learns to predict rankings. The language model is then fine-tuned via RL to maximize reward — optimizing for human preference, not just instruction-following.
| Stage | Input | Goal | Cost | Typical size |
| Pretraining | Raw internet text (tokens) | Next-token prediction | Very high | Billions of tokens |
| SFT | Curated (prompt, response) pairs | Instruction-following | Medium | 10k–100k examples |
| RLHF | Human preference rankings | Align to human values | Medium | 50k–200k examples |
💡
Why pretraining matters: A model trained only on internet text without RLHF is useless — it completes text, not answers questions. SFT + RLHF take a raw transformer and make it into an assistant. The real magic is pretraining at scale, which is inaccessible to most teams.
02 — Commercial
Frontier Commercial Models
Frontier models are state-of-the-art large language models from OpenAI, Anthropic, Google, and other large labs. They define the current capability ceiling. These models are expensive to run and available only via API, but they offer the best quality, reasoning, and multimodal capabilities.
The Frontier Lineup
1
GPT-4o — OpenAI's multimodal flagship
Fast, capable, sees images and text. Excellent for reasoning, code generation, and customer-facing applications.
- Training: Multimodal (text + vision)
- Context window: 128k tokens
- Strengths: Reasoning, code, math, structured output
- Cost: ~$0.005 per 1k input tokens; ~$0.015 per 1k output tokens
- API: https://platform.openai.com/docs/models/gpt-4o
2
Claude 3.5 Sonnet — Anthropic's balanced choice
Long context window (200k), excellent instruction-following, strong at nuance and safety. Best for document understanding and RAG systems.
- Training: Text-based, constitutional AI alignment
- Context window: 200k tokens (can fit entire books)
- Strengths: Long context, instruction-following, nuance
- Cost: ~$0.003 per 1k input tokens; ~$0.015 per 1k output tokens
- API: https://www.anthropic.com/claude
3
Gemini 1.5 Pro — Google's multimodal powerhouse
Sees video, audio, images, and text. Massive 2M token context. Excellent for multimodal understanding and search integration.
- Training: Multimodal (text, image, video, audio)
- Context window: 2M tokens (10x most competitors)
- Strengths: Multimodal, long context, search integration
- Cost: ~$0.00375 per 1k input tokens; ~$0.015 per 1k output tokens
- API: https://ai.google.dev/docs/gemini_api
4
o3 — OpenAI's reasoning specialist
Optimized for complex reasoning with test-time compute. Best-in-class on AIME, Codeforces, research tasks. Not for real-time applications.
- Training: Chain-of-thought reasoning optimization
- Context window: 128k tokens
- Strengths: Reasoning, STEM, competitive programming
- Cost: Premium pricing for thinking time (negotiated)
- API: Limited availability, research access only
⚠️
Model selection tip: GPT-4: reasoning; Claude: long context; Gemini: multimodal. Cost vs quality spectrum: o3 > GPT-4o > Claude > open-source. Always benchmark on your task — SOTA is fluid and changes monthly.
03 — Open Source
Open-Source Models
Open-source LLMs can be downloaded and run locally or self-hosted. They offer privacy, no per-token costs (only compute), and full customization via fine-tuning. The tradeoff: lower capability, more engineering work, higher infrastructure costs.
Top Open-Source Models
1
Llama 3.1 — Meta's efficient powerhouse
8B and 70B variants. Fast, capable, widely deployed. Best open-source base model for fine-tuning.
- Sizes: 8B (fast), 70B (quality), 405B (frontier-class)
- Context: 128k tokens
- Strengths: Speed, code, instruction-following
- Use: Self-host, fine-tune, or via API (Together AI, Groq)
- Cost: ~$0.0002–0.001 per 1k tokens on inference APIs
2
Mistral 8x22B — Mixture-of-Experts efficiency
MoE architecture: only activates 2 of 8 experts per token. Fast inference, excellent quality at smaller model size.
- Architecture: Sparse mixture-of-experts (MoE)
- Context: 65k tokens
- Strengths: Inference speed, cost-efficiency
- Use: Self-host or via API (Mistral, Together AI)
- Cost: Lower than dense models for same quality
3
Qwen 2.5 — Alibaba's multilingual competitor
Bilingual (Chinese + English), excellent code understanding, strong on knowledge tasks. Popular in Asia.
- Sizes: 1.5B (tiny), 7B, 14B, 32B, 72B
- Context: 128k tokens (32B+)
- Strengths: Multilingual, code, knowledge
- Use: Self-host or via API (Alibaba Cloud, Together AI)
- Cost: Comparable to Llama
4
Phi-3 — Microsoft's distilled powerhouse
Tiny (3.8B–14B) but surprisingly capable. Optimized for mobile and edge devices.
- Sizes: 3.8B (mobile), 7B, 14B
- Context: 128k tokens
- Strengths: Efficiency, mobile deployment, reasoning
- Use: Edge devices, local-first applications
- Cost: Very low inference cost
✓
When to use open-source: Privacy-critical data, high-volume inference (cost sensitivity), need to fine-tune, locked-in compliance requirements. When not to: Need frontier reasoning, tight SLA on latency, no ML infrastructure.
04 — Selection
Choosing the Right Model
Model selection is a multidimensional optimization problem: capability (can it solve your task?), cost (how much per inference?), latency (how fast?), context window (can it fit your documents?), and multimodality (does it need to see images?). No single model dominates all dimensions.
Decision Framework
🚀 Capability
- Benchmark on your task (QA, summarization, code, etc.)
- Test reasoning: frontier > open-source
- Use evals: MMLU, HumanEval, HELM
💰 Cost
- API: ~$0.001–0.01 per 1k output tokens
- Self-host: GPU compute (H100 ~$2–3/hr)
- Volume discounts: negotiate for high volume
⚡ Latency
- API latency: 500ms–2s p95 typical
- Self-hosted: 50–200ms (H100, 8B model)
- Real-time: need sub-100ms = distilled model
📏 Context Window
- Document Q&A: 200k+ (Claude, Gemini)
- Chat: 4k–32k sufficient
- Code repository: 128k+ (Llama 3.1, Qwen)
# Benchmarking LLMs: latency, cost, and quality in one script
# pip install openai anthropic google-generativeai
import time, openai, anthropic
def benchmark_llm(prompt: str, model_configs: list) -> list:
results = []
for cfg in model_configs:
start = time.perf_counter()
if cfg["provider"] == "openai":
client = openai.OpenAI()
resp = client.chat.completions.create(
model=cfg["model"],
messages=[{"role":"user","content":prompt}],
max_tokens=200,
)
text = resp.choices[0].message.content
usage = resp.usage
in_tok, out_tok = usage.prompt_tokens, usage.completion_tokens
elif cfg["provider"] == "anthropic":
client = anthropic.Anthropic()
resp = client.messages.create(
model=cfg["model"],
max_tokens=200,
messages=[{"role":"user","content":prompt}],
)
text = resp.content[0].text
in_tok, out_tok = resp.usage.input_tokens, resp.usage.output_tokens
latency = time.perf_counter() - start
cost = in_tok/1e6 * cfg["in_$/1m"] + out_tok/1e6 * cfg["out_$/1m"]
results.append({"model":cfg["model"], "latency_s":round(latency,2),
"cost_$":round(cost,5), "tokens":out_tok, "output":text[:80]})
return results
configs = [
{"provider":"openai", "model":"gpt-4o-mini", "in_$/1m":0.15, "out_$/1m":0.60},
{"provider":"openai", "model":"gpt-4o", "in_$/1m":2.50, "out_$/1m":10.0},
{"provider":"anthropic", "model":"claude-haiku-4-5-20251001","in_$/1m":0.25, "out_$/1m":1.25},
]
rows = benchmark_llm("Explain the transformer attention mechanism in 3 sentences.", configs)
for r in rows:
print(f"{r['model']:30s} {r['latency_s']:.2f}s ${r['cost_$']:.5f} {r['tokens']}tok")
Quick Decision Tree
Need reasoning (math, code)? → GPT-4o or o3. Long documents (RAG)? → Claude 3.5 Sonnet. Multimodal (images, video)? → Gemini 1.5 or GPT-4o. Cost-sensitive, high volume? → Llama 3.1 70B (self-hosted) or Mistral. Mobile / edge? → Phi-3. Bilingual? → Qwen 2.5.
| Use case | Best model | Runner-up | Approximate cost |
| Complex reasoning | GPT-4o / o3 | Claude 3.5 | $0.01–0.05 per query |
| Long document Q&A | Claude 3.5 | Gemini 1.5 Pro | $0.01–0.02 per query |
| Real-time chat | Llama 8B or Phi-3 | Mistral 8x22B | $0.0001 per query (self-hosted) |
| Multimodal tasks | Gemini 1.5 Pro | GPT-4o | $0.01–0.02 per query |
| Cost-sensitive | Llama 3.1 70B | Qwen 2.5 32B | $0.0002 per 1k tokens |
05 — Code
Working Code Example
Here's how to call multiple LLM providers with a unified interface. This example uses Anthropic Claude, OpenAI GPT-4o, and LiteLLM (a wrapper library that unifies all APIs).
import anthropic
import openai
# Anthropic Claude
anthropic_client = anthropic.Anthropic()
claude_resp = anthropic_client.messages.create(
model="claude-opus-4-5",
max_tokens=256,
messages=[{"role": "user", "content": "Explain transformers in one paragraph."}]
)
print("Claude:", claude_resp.content[0].text)
# OpenAI GPT
openai_client = openai.OpenAI()
gpt_resp = openai_client.chat.completions.create(
model="gpt-4o",
max_tokens=256,
messages=[{"role": "user", "content": "Explain transformers in one paragraph."}]
)
print("GPT-4o:", gpt_resp.choices[0].message.content)
# LiteLLM for unified interface across providers
from litellm import completion
resp = completion(
model="gemini/gemini-1.5-pro",
messages=[{"role": "user", "content": "Explain transformers in one paragraph."}],
max_tokens=256
)
print("Gemini:", resp.choices[0].message.content)
Setup
Install dependencies: pip install anthropic openai litellm google-generativeai. Set environment variables: ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY.
💡
LiteLLM benefit: Swap models without changing code. Use it for A/B testing different models, automatic fallback on API failure, and unified logging across providers.
# LLM routing: send queries to the cheapest capable model
# pip install openai anthropic
import openai
client = openai.OpenAI()
COMPLEXITY_CLASSIFIER = """Classify this user query as SIMPLE or COMPLEX.
SIMPLE: factual lookups, short answers, basic rewrites.
COMPLEX: multi-step reasoning, code generation, analysis, synthesis.
Reply with one word: SIMPLE or COMPLEX."""
def route_query(query: str) -> tuple[str, str]:
"""Returns (model, reason)."""
classification = client.chat.completions.create(
model="gpt-4o-mini", # cheap model classifies first
messages=[{"role":"system","content":COMPLEXITY_CLASSIFIER},
{"role":"user","content":query}],
max_tokens=5,
).choices[0].message.content.strip().upper()
if classification == "SIMPLE":
return "gpt-4o-mini", "Simple query → cheap model"
else:
return "gpt-4o", "Complex query → powerful model"
def smart_complete(query: str) -> str:
model, reason = route_query(query)
print(f" [{reason}] using {model}")
return client.chat.completions.create(
model=model,
messages=[{"role":"user","content":query}],
).choices[0].message.content
print(smart_complete("What year was Python created?")) # → cheap
print(smart_complete("Design a rate limiter for a REST API")) # → powerful
06 — Next Steps
What to Explore Next
Large Language Models are the foundation, but you need to know the underlying mechanisms and how different approaches (frontier vs open-source) solve different problems. Explore these child concept pages:
1
Deep dive into GPT-4o, Claude 3.5, Gemini 1.5 Pro, and o3: their capabilities, pricing, context windows, and when to use each. Includes model cards and SOTA benchmarks.
07 — Further Reading
References
Foundational Papers
-
Paper
Brown, T. et al. (2020).
Language Models are Few-Shot Learners (GPT-3).
arXiv:2005.14165. —
arxiv:2005.14165 ↗
-
Paper
Touvron, H. et al. (2023).
Llama 2: Open Foundation and Fine-Tuned Chat Models.
arXiv:2307.09288. —
arxiv:2307.09288 ↗
-
Paper
Hoffmann, B. et al. (2022).
Training Compute-Optimal LLMs (Chinchilla).
arXiv:2203.15556. —
arxiv:2203.15556 ↗
Model Cards & Documentation
Benchmarks & Leaderboards
LEARNING PATH
Learning Path
LLMs build on transformers, pre-training, and alignment. Here's the recommended progression:
Transformersarchitecture
→
Pre-trainingnext-token loss
→
SFTinstruction follow
→
RLHF / DPOalignment
→
Promptingusing LLMs
1
Understand the transformer first
LLMs are autoregressive transformers. If you haven't implemented attention yourself, start with Transformers before diving into LLM training specifics.
2
Grasp scaling laws
The Chinchilla paper (Hoffmann et al., 2022) shows that model size and training tokens scale together — optimal training uses ~20 tokens per parameter. This is why Llama 3 8B is trained on 15T tokens.
3
Learn the alignment stack
SFT makes models follow instructions. RLHF (or DPO) makes them follow human preferences. Constitutional AI adds principle-based self-critique. Each is layered on the pre-trained base.
4
Practice prompting before fine-tuning
95% of production LLM tasks can be solved with good prompting. Fine-tune only when you have 500+ high-quality examples and prompting has a provable ceiling. See Prompting.