LARGE LANGUAGE MODELS

Large Language Models

The foundation of every GenAI system — how they're built and how to choose between them

pretrain → SFT → RLHF the training stack
frontier vs open-source the key choice
cost · speed · capability the tradeoffs
Contents
  1. How LLMs are built
  2. Frontier commercial models
  3. Open-source models
  4. Choosing the right model
  5. Working code example
  6. What to explore next
  7. References
01 — Foundation

How LLMs Are Built

Large Language Models are not trained end-to-end for your specific task. Instead, they follow a three-stage pipeline: pretraining (unsupervised learning on vast text), supervised fine-tuning (SFT, instruction-following), and reinforcement learning from human feedback (RLHF, preference alignment).

Stage 1: Pretraining

Models are trained on hundreds of billions of tokens of diverse internet text using next-token prediction. This is unsupervised learning — no labels, just pattern-matching. The goal is to absorb world knowledge, reasoning patterns, and language structure. Pretraining is expensive (millions of dollars) and done only by a few organizations.

Stage 2: Supervised Fine-Tuning (SFT)

After pretraining, models still refuse to follow instructions and generate unhelpful outputs. SFT teaches instruction-following by fine-tuning on thousands of (prompt, response) pairs. The response is ideal: helpful, harmless, honest. SFT anchors the model to human intent.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

SFT still leaves gaps: models can't rank which outputs are better, only follow instructions. RLHF fixes this. Humans rank model outputs pairwise, and a reward model learns to predict rankings. The language model is then fine-tuned via RL to maximize reward — optimizing for human preference, not just instruction-following.

StageInputGoalCostTypical size
PretrainingRaw internet text (tokens)Next-token predictionVery highBillions of tokens
SFTCurated (prompt, response) pairsInstruction-followingMedium10k–100k examples
RLHFHuman preference rankingsAlign to human valuesMedium50k–200k examples
💡 Why pretraining matters: A model trained only on internet text without RLHF is useless — it completes text, not answers questions. SFT + RLHF take a raw transformer and make it into an assistant. The real magic is pretraining at scale, which is inaccessible to most teams.
02 — Commercial

Frontier Commercial Models

Frontier models are state-of-the-art large language models from OpenAI, Anthropic, Google, and other large labs. They define the current capability ceiling. These models are expensive to run and available only via API, but they offer the best quality, reasoning, and multimodal capabilities.

The Frontier Lineup

1

GPT-4o — OpenAI's multimodal flagship

Fast, capable, sees images and text. Excellent for reasoning, code generation, and customer-facing applications.

  • Training: Multimodal (text + vision)
  • Context window: 128k tokens
  • Strengths: Reasoning, code, math, structured output
  • Cost: ~$0.005 per 1k input tokens; ~$0.015 per 1k output tokens
  • API: https://platform.openai.com/docs/models/gpt-4o
2

Claude 3.5 Sonnet — Anthropic's balanced choice

Long context window (200k), excellent instruction-following, strong at nuance and safety. Best for document understanding and RAG systems.

  • Training: Text-based, constitutional AI alignment
  • Context window: 200k tokens (can fit entire books)
  • Strengths: Long context, instruction-following, nuance
  • Cost: ~$0.003 per 1k input tokens; ~$0.015 per 1k output tokens
  • API: https://www.anthropic.com/claude
3

Gemini 1.5 Pro — Google's multimodal powerhouse

Sees video, audio, images, and text. Massive 2M token context. Excellent for multimodal understanding and search integration.

  • Training: Multimodal (text, image, video, audio)
  • Context window: 2M tokens (10x most competitors)
  • Strengths: Multimodal, long context, search integration
  • Cost: ~$0.00375 per 1k input tokens; ~$0.015 per 1k output tokens
  • API: https://ai.google.dev/docs/gemini_api
4

o3 — OpenAI's reasoning specialist

Optimized for complex reasoning with test-time compute. Best-in-class on AIME, Codeforces, research tasks. Not for real-time applications.

  • Training: Chain-of-thought reasoning optimization
  • Context window: 128k tokens
  • Strengths: Reasoning, STEM, competitive programming
  • Cost: Premium pricing for thinking time (negotiated)
  • API: Limited availability, research access only
⚠️ Model selection tip: GPT-4: reasoning; Claude: long context; Gemini: multimodal. Cost vs quality spectrum: o3 > GPT-4o > Claude > open-source. Always benchmark on your task — SOTA is fluid and changes monthly.
03 — Open Source

Open-Source Models

Open-source LLMs can be downloaded and run locally or self-hosted. They offer privacy, no per-token costs (only compute), and full customization via fine-tuning. The tradeoff: lower capability, more engineering work, higher infrastructure costs.

Top Open-Source Models

1

Llama 3.1 — Meta's efficient powerhouse

8B and 70B variants. Fast, capable, widely deployed. Best open-source base model for fine-tuning.

  • Sizes: 8B (fast), 70B (quality), 405B (frontier-class)
  • Context: 128k tokens
  • Strengths: Speed, code, instruction-following
  • Use: Self-host, fine-tune, or via API (Together AI, Groq)
  • Cost: ~$0.0002–0.001 per 1k tokens on inference APIs
2

Mistral 8x22B — Mixture-of-Experts efficiency

MoE architecture: only activates 2 of 8 experts per token. Fast inference, excellent quality at smaller model size.

  • Architecture: Sparse mixture-of-experts (MoE)
  • Context: 65k tokens
  • Strengths: Inference speed, cost-efficiency
  • Use: Self-host or via API (Mistral, Together AI)
  • Cost: Lower than dense models for same quality
3

Qwen 2.5 — Alibaba's multilingual competitor

Bilingual (Chinese + English), excellent code understanding, strong on knowledge tasks. Popular in Asia.

  • Sizes: 1.5B (tiny), 7B, 14B, 32B, 72B
  • Context: 128k tokens (32B+)
  • Strengths: Multilingual, code, knowledge
  • Use: Self-host or via API (Alibaba Cloud, Together AI)
  • Cost: Comparable to Llama
4

Phi-3 — Microsoft's distilled powerhouse

Tiny (3.8B–14B) but surprisingly capable. Optimized for mobile and edge devices.

  • Sizes: 3.8B (mobile), 7B, 14B
  • Context: 128k tokens
  • Strengths: Efficiency, mobile deployment, reasoning
  • Use: Edge devices, local-first applications
  • Cost: Very low inference cost
When to use open-source: Privacy-critical data, high-volume inference (cost sensitivity), need to fine-tune, locked-in compliance requirements. When not to: Need frontier reasoning, tight SLA on latency, no ML infrastructure.
04 — Selection

Choosing the Right Model

Model selection is a multidimensional optimization problem: capability (can it solve your task?), cost (how much per inference?), latency (how fast?), context window (can it fit your documents?), and multimodality (does it need to see images?). No single model dominates all dimensions.

Decision Framework

🚀 Capability

  • Benchmark on your task (QA, summarization, code, etc.)
  • Test reasoning: frontier > open-source
  • Use evals: MMLU, HumanEval, HELM

💰 Cost

  • API: ~$0.001–0.01 per 1k output tokens
  • Self-host: GPU compute (H100 ~$2–3/hr)
  • Volume discounts: negotiate for high volume

Latency

  • API latency: 500ms–2s p95 typical
  • Self-hosted: 50–200ms (H100, 8B model)
  • Real-time: need sub-100ms = distilled model

📏 Context Window

  • Document Q&A: 200k+ (Claude, Gemini)
  • Chat: 4k–32k sufficient
  • Code repository: 128k+ (Llama 3.1, Qwen)
# Benchmarking LLMs: latency, cost, and quality in one script # pip install openai anthropic google-generativeai import time, openai, anthropic def benchmark_llm(prompt: str, model_configs: list) -> list: results = [] for cfg in model_configs: start = time.perf_counter() if cfg["provider"] == "openai": client = openai.OpenAI() resp = client.chat.completions.create( model=cfg["model"], messages=[{"role":"user","content":prompt}], max_tokens=200, ) text = resp.choices[0].message.content usage = resp.usage in_tok, out_tok = usage.prompt_tokens, usage.completion_tokens elif cfg["provider"] == "anthropic": client = anthropic.Anthropic() resp = client.messages.create( model=cfg["model"], max_tokens=200, messages=[{"role":"user","content":prompt}], ) text = resp.content[0].text in_tok, out_tok = resp.usage.input_tokens, resp.usage.output_tokens latency = time.perf_counter() - start cost = in_tok/1e6 * cfg["in_$/1m"] + out_tok/1e6 * cfg["out_$/1m"] results.append({"model":cfg["model"], "latency_s":round(latency,2), "cost_$":round(cost,5), "tokens":out_tok, "output":text[:80]}) return results configs = [ {"provider":"openai", "model":"gpt-4o-mini", "in_$/1m":0.15, "out_$/1m":0.60}, {"provider":"openai", "model":"gpt-4o", "in_$/1m":2.50, "out_$/1m":10.0}, {"provider":"anthropic", "model":"claude-haiku-4-5-20251001","in_$/1m":0.25, "out_$/1m":1.25}, ] rows = benchmark_llm("Explain the transformer attention mechanism in 3 sentences.", configs) for r in rows: print(f"{r['model']:30s} {r['latency_s']:.2f}s ${r['cost_$']:.5f} {r['tokens']}tok")

Quick Decision Tree

Need reasoning (math, code)?GPT-4o or o3. Long documents (RAG)?Claude 3.5 Sonnet. Multimodal (images, video)? → Gemini 1.5 or GPT-4o. Cost-sensitive, high volume? → Llama 3.1 70B (self-hosted) or Mistral. Mobile / edge? → Phi-3. Bilingual?Qwen 2.5.

Use caseBest modelRunner-upApproximate cost
Complex reasoningGPT-4o / o3Claude 3.5$0.01–0.05 per query
Long document Q&AClaude 3.5Gemini 1.5 Pro$0.01–0.02 per query
Real-time chatLlama 8B or Phi-3Mistral 8x22B$0.0001 per query (self-hosted)
Multimodal tasksGemini 1.5 ProGPT-4o$0.01–0.02 per query
Cost-sensitiveLlama 3.1 70BQwen 2.5 32B$0.0002 per 1k tokens
05 — Code

Working Code Example

Here's how to call multiple LLM providers with a unified interface. This example uses Anthropic Claude, OpenAI GPT-4o, and LiteLLM (a wrapper library that unifies all APIs).

import anthropic import openai # Anthropic Claude anthropic_client = anthropic.Anthropic() claude_resp = anthropic_client.messages.create( model="claude-opus-4-5", max_tokens=256, messages=[{"role": "user", "content": "Explain transformers in one paragraph."}] ) print("Claude:", claude_resp.content[0].text) # OpenAI GPT openai_client = openai.OpenAI() gpt_resp = openai_client.chat.completions.create( model="gpt-4o", max_tokens=256, messages=[{"role": "user", "content": "Explain transformers in one paragraph."}] ) print("GPT-4o:", gpt_resp.choices[0].message.content) # LiteLLM for unified interface across providers from litellm import completion resp = completion( model="gemini/gemini-1.5-pro", messages=[{"role": "user", "content": "Explain transformers in one paragraph."}], max_tokens=256 ) print("Gemini:", resp.choices[0].message.content)

Setup

Install dependencies: pip install anthropic openai litellm google-generativeai. Set environment variables: ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY.

💡 LiteLLM benefit: Swap models without changing code. Use it for A/B testing different models, automatic fallback on API failure, and unified logging across providers.
# LLM routing: send queries to the cheapest capable model # pip install openai anthropic import openai client = openai.OpenAI() COMPLEXITY_CLASSIFIER = """Classify this user query as SIMPLE or COMPLEX. SIMPLE: factual lookups, short answers, basic rewrites. COMPLEX: multi-step reasoning, code generation, analysis, synthesis. Reply with one word: SIMPLE or COMPLEX.""" def route_query(query: str) -> tuple[str, str]: """Returns (model, reason).""" classification = client.chat.completions.create( model="gpt-4o-mini", # cheap model classifies first messages=[{"role":"system","content":COMPLEXITY_CLASSIFIER}, {"role":"user","content":query}], max_tokens=5, ).choices[0].message.content.strip().upper() if classification == "SIMPLE": return "gpt-4o-mini", "Simple query → cheap model" else: return "gpt-4o", "Complex query → powerful model" def smart_complete(query: str) -> str: model, reason = route_query(query) print(f" [{reason}] using {model}") return client.chat.completions.create( model=model, messages=[{"role":"user","content":query}], ).choices[0].message.content print(smart_complete("What year was Python created?")) # → cheap print(smart_complete("Design a rate limiter for a REST API")) # → powerful
06 — Next Steps

What to Explore Next

Large Language Models are the foundation, but you need to know the underlying mechanisms and how different approaches (frontier vs open-source) solve different problems. Explore these child concept pages:

1

Frontier Models — State-of-the-art commercial

Deep dive into GPT-4o, Claude 3.5, Gemini 1.5 Pro, and o3: their capabilities, pricing, context windows, and when to use each. Includes model cards and SOTA benchmarks.

2

Open-Source LLMs — Run locally

Llama 3.1, Mistral, Qwen 2.5, Phi-3: how to download, quantize, fine-tune, and deploy. Infrastructure tradeoffs, inference optimization, and cost analysis for self-hosted models.

3

LLM Internals — Under the hood

Tokenization, attention mechanism, context window limits, scaling laws, and emergent abilities. Why are transformers effective? What are the theoretical limits?

07 — Further Reading

References

Foundational Papers
Model Cards & Documentation
Benchmarks & Leaderboards
LEARNING PATH

Learning Path

LLMs build on transformers, pre-training, and alignment. Here's the recommended progression:

Transformersarchitecture
Pre-trainingnext-token loss
SFTinstruction follow
RLHF / DPOalignment
Promptingusing LLMs
1

Understand the transformer first

LLMs are autoregressive transformers. If you haven't implemented attention yourself, start with Transformers before diving into LLM training specifics.

2

Grasp scaling laws

The Chinchilla paper (Hoffmann et al., 2022) shows that model size and training tokens scale together — optimal training uses ~20 tokens per parameter. This is why Llama 3 8B is trained on 15T tokens.

3

Learn the alignment stack

SFT makes models follow instructions. RLHF (or DPO) makes them follow human preferences. Constitutional AI adds principle-based self-critique. Each is layered on the pre-trained base.

4

Practice prompting before fine-tuning

95% of production LLM tasks can be solved with good prompting. Fine-tune only when you have 500+ high-quality examples and prompting has a provable ceiling. See Prompting.