Large Language Models (LLMs)

Contents

How LLMs are built
Frontier commercial models
Open-source models
Choosing the right model
Working code example
What to explore next
References

01 — Foundation

How LLMs Are Built

Large Language Models are not trained end-to-end for your specific task. Instead, they follow a three-stage pipeline: pretraining (unsupervised learning on vast text), supervised fine-tuning (SFT, instruction-following), and reinforcement learning from human feedback (RLHF, preference alignment).

Stage 1: Pretraining

Models are trained on hundreds of billions of tokens of diverse internet text using next-token prediction. This is unsupervised learning — no labels, just pattern-matching. The goal is to absorb world knowledge, reasoning patterns, and language structure. Pretraining is expensive (millions of dollars) and done only by a few organizations.

Stage 2: Supervised Fine-Tuning (SFT)

After pretraining, models still refuse to follow instructions and generate unhelpful outputs. SFT teaches instruction-following by fine-tuning on thousands of (prompt, response) pairs. The response is ideal: helpful, harmless, honest. SFT anchors the model to human intent.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

SFT still leaves gaps: models can't rank which outputs are better, only follow instructions. RLHF fixes this. Humans rank model outputs pairwise, and a reward model learns to predict rankings. The language model is then fine-tuned via RL to maximize reward — optimizing for human preference, not just instruction-following.

Stage	Input	Goal	Cost	Typical size
Pretraining	Raw internet text (tokens)	Next-token prediction	Very high	Billions of tokens
SFT	Curated (prompt, response) pairs	Instruction-following	Medium	10k–100k examples
RLHF	Human preference rankings	Align to human values	Medium	50k–200k examples

💡 Why pretraining matters: A model trained only on internet text without RLHF is useless — it completes text, not answers questions. SFT + RLHF take a raw transformer and make it into an assistant. The real magic is pretraining at scale, which is inaccessible to most teams.

02 — Commercial

Frontier Commercial Models

Frontier models are state-of-the-art large language models from OpenAI, Anthropic, Google, and other large labs. They define the current capability ceiling. These models are expensive to run and available only via API, but they offer the best quality, reasoning, and multimodal capabilities.

The Frontier Lineup

GPT-4o — OpenAI's multimodal flagship

Fast, capable, sees images and text. Excellent for reasoning, code generation, and customer-facing applications.

Training: Multimodal (text + vision)
Context window: 128k tokens
Strengths: Reasoning, code, math, structured output
Cost: ~$0.005 per 1k input tokens; ~$0.015 per 1k output tokens
API: https://platform.openai.com/docs/models/gpt-4o

Claude 3.5 Sonnet — Anthropic's balanced choice

Long context window (200k), excellent instruction-following, strong at nuance and safety. Best for document understanding and RAG systems.

Training: Text-based, constitutional AI alignment
Context window: 200k tokens (can fit entire books)
Strengths: Long context, instruction-following, nuance
Cost: ~$0.003 per 1k input tokens; ~$0.015 per 1k output tokens
API: https://www.anthropic.com/claude

Gemini 1.5 Pro — Google's multimodal powerhouse

Sees video, audio, images, and text. Massive 2M token context. Excellent for multimodal understanding and search integration.

Training: Multimodal (text, image, video, audio)
Context window: 2M tokens (10x most competitors)
Strengths: Multimodal, long context, search integration
Cost: ~$0.00375 per 1k input tokens; ~$0.015 per 1k output tokens
API: https://ai.google.dev/docs/gemini_api

o3 — OpenAI's reasoning specialist

Optimized for complex reasoning with test-time compute. Best-in-class on AIME, Codeforces, research tasks. Not for real-time applications.

Training: Chain-of-thought reasoning optimization
Context window: 128k tokens
Strengths: Reasoning, STEM, competitive programming
Cost: Premium pricing for thinking time (negotiated)
API: Limited availability, research access only

⚠️ Model selection tip: GPT-4: reasoning; Claude: long context; Gemini: multimodal. Cost vs quality spectrum: o3 > GPT-4o > Claude > open-source. Always benchmark on your task — SOTA is fluid and changes monthly.

03 — Open Source

Open-Source Models

Open-source LLMs can be downloaded and run locally or self-hosted. They offer privacy, no per-token costs (only compute), and full customization via fine-tuning. The tradeoff: lower capability, more engineering work, higher infrastructure costs.

Top Open-Source Models

Llama 3.1 — Meta's efficient powerhouse

8B and 70B variants. Fast, capable, widely deployed. Best open-source base model for fine-tuning.

Sizes: 8B (fast), 70B (quality), 405B (frontier-class)
Context: 128k tokens
Strengths: Speed, code, instruction-following
Use: Self-host, fine-tune, or via API (Together AI, Groq)
Cost: ~$0.0002–0.001 per 1k tokens on inference APIs

Mistral 8x22B — Mixture-of-Experts efficiency

MoE architecture: only activates 2 of 8 experts per token. Fast inference, excellent quality at smaller model size.

Architecture: Sparse mixture-of-experts (MoE)
Context: 65k tokens
Strengths: Inference speed, cost-efficiency
Use: Self-host or via API (Mistral, Together AI)
Cost: Lower than dense models for same quality

Qwen 2.5 — Alibaba's multilingual competitor

Bilingual (Chinese + English), excellent code understanding, strong on knowledge tasks. Popular in Asia.

Sizes: 1.5B (tiny), 7B, 14B, 32B, 72B
Context: 128k tokens (32B+)
Strengths: Multilingual, code, knowledge
Use: Self-host or via API (Alibaba Cloud, Together AI)
Cost: Comparable to Llama

Phi-3 — Microsoft's distilled powerhouse

Tiny (3.8B–14B) but surprisingly capable. Optimized for mobile and edge devices.

Sizes: 3.8B (mobile), 7B, 14B
Context: 128k tokens
Strengths: Efficiency, mobile deployment, reasoning
Use: Edge devices, local-first applications
Cost: Very low inference cost

✓ When to use open-source: Privacy-critical data, high-volume inference (cost sensitivity), need to fine-tune, locked-in compliance requirements. When not to: Need frontier reasoning, tight SLA on latency, no ML infrastructure.

04 — Selection

Choosing the Right Model

Model selection is a multidimensional optimization problem: capability (can it solve your task?), cost (how much per inference?), latency (how fast?), context window (can it fit your documents?), and multimodality (does it need to see images?). No single model dominates all dimensions.

Decision Framework

🚀 Capability

Benchmark on your task (QA, summarization, code, etc.)
Test reasoning: frontier > open-source
Use evals: MMLU, HumanEval, HELM

💰 Cost

API: ~$0.001–0.01 per 1k output tokens
Self-host: GPU compute (H100 ~$2–3/hr)
Volume discounts: negotiate for high volume

⚡ Latency

API latency: 500ms–2s p95 typical
Self-hosted: 50–200ms (H100, 8B model)
Real-time: need sub-100ms = distilled model

📏 Context Window

Document Q&A: 200k+ (Claude, Gemini)
Chat: 4k–32k sufficient
Code repository: 128k+ (Llama 3.1, Qwen)

# Benchmarking LLMs: latency, cost, and quality in one script # pip install openai anthropic google-generativeai import time, openai, anthropic def benchmark_llm(prompt: str, model_configs: list) -> list: results = [] for cfg in model_configs: start = time.perf_counter() if cfg["provider"] == "openai": client = openai.OpenAI() resp = client.chat.completions.create( model=cfg["model"], messages=[{"role":"user","content":prompt}], max_tokens=200, ) text = resp.choices[0].message.content usage = resp.usage in_tok, out_tok = usage.prompt_tokens, usage.completion_tokens elif cfg["provider"] == "anthropic": client = anthropic.Anthropic() resp = client.messages.create( model=cfg["model"], max_tokens=200, messages=[{"role":"user","content":prompt}], ) text = resp.content[0].text in_tok, out_tok = resp.usage.input_tokens, resp.usage.output_tokens latency = time.perf_counter() - start cost = in_tok/1e6 * cfg["in_$/1m"] + out_tok/1e6 * cfg["out_$/1m"] results.append({"model":cfg["model"], "latency_s":round(latency,2), "cost_$":round(cost,5), "tokens":out_tok, "output":text[:80]}) return results configs = [ {"provider":"openai", "model":"gpt-4o-mini", "in_$/1m":0.15, "out_$/1m":0.60}, {"provider":"openai", "model":"gpt-4o", "in_$/1m":2.50, "out_$/1m":10.0}, {"provider":"anthropic", "model":"claude-haiku-4-5-20251001","in_$/1m":0.25, "out_$/1m":1.25}, ] rows = benchmark_llm("Explain the transformer attention mechanism in 3 sentences.", configs) for r in rows: print(f"{r['model']:30s} {r['latency_s']:.2f}s ${r['cost_$']:.5f} {r['tokens']}tok")

Quick Decision Tree

Need reasoning (math, code)? → GPT-4o or o3. Long documents (RAG)? → Claude 3.5 Sonnet. Multimodal (images, video)? → Gemini 1.5 or GPT-4o. Cost-sensitive, high volume? → Llama 3.1 70B (self-hosted) or Mistral. Mobile / edge? → Phi-3. Bilingual? → Qwen 2.5.

Use case	Best model	Runner-up	Approximate cost
Complex reasoning	GPT-4o / o3	Claude 3.5	$0.01–0.05 per query
Long document Q&A	Claude 3.5	Gemini 1.5 Pro	$0.01–0.02 per query
Real-time chat	Llama 8B or Phi-3	Mistral 8x22B	$0.0001 per query (self-hosted)
Multimodal tasks	Gemini 1.5 Pro	GPT-4o	$0.01–0.02 per query
Cost-sensitive	Llama 3.1 70B	Qwen 2.5 32B	$0.0002 per 1k tokens

05 — Code

Working Code Example

Here's how to call multiple LLM providers with a unified interface. This example uses Anthropic Claude, OpenAI GPT-4o, and LiteLLM (a wrapper library that unifies all APIs).

import anthropic import openai # Anthropic Claude anthropic_client = anthropic.Anthropic() claude_resp = anthropic_client.messages.create( model="claude-opus-4-5", max_tokens=256, messages=[{"role": "user", "content": "Explain transformers in one paragraph."}] ) print("Claude:", claude_resp.content[0].text) # OpenAI GPT openai_client = openai.OpenAI() gpt_resp = openai_client.chat.completions.create( model="gpt-4o", max_tokens=256, messages=[{"role": "user", "content": "Explain transformers in one paragraph."}] ) print("GPT-4o:", gpt_resp.choices[0].message.content) # LiteLLM for unified interface across providers from litellm import completion resp = completion( model="gemini/gemini-1.5-pro", messages=[{"role": "user", "content": "Explain transformers in one paragraph."}], max_tokens=256 ) print("Gemini:", resp.choices[0].message.content)

Setup

Install dependencies: pip install anthropic openai litellm google-generativeai. Set environment variables: ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY.

💡 LiteLLM benefit: Swap models without changing code. Use it for A/B testing different models, automatic fallback on API failure, and unified logging across providers.

# LLM routing: send queries to the cheapest capable model # pip install openai anthropic import openai client = openai.OpenAI() COMPLEXITY_CLASSIFIER = """Classify this user query as SIMPLE or COMPLEX. SIMPLE: factual lookups, short answers, basic rewrites. COMPLEX: multi-step reasoning, code generation, analysis, synthesis. Reply with one word: SIMPLE or COMPLEX.""" def route_query(query: str) -> tuple[str, str]: """Returns (model, reason).""" classification = client.chat.completions.create( model="gpt-4o-mini", # cheap model classifies first messages=[{"role":"system","content":COMPLEXITY_CLASSIFIER}, {"role":"user","content":query}], max_tokens=5, ).choices[0].message.content.strip().upper() if classification == "SIMPLE": return "gpt-4o-mini", "Simple query → cheap model" else: return "gpt-4o", "Complex query → powerful model" def smart_complete(query: str) -> str: model, reason = route_query(query) print(f" [{reason}] using {model}") return client.chat.completions.create( model=model, messages=[{"role":"user","content":query}], ).choices[0].message.content print(smart_complete("What year was Python created?")) # → cheap print(smart_complete("Design a rate limiter for a REST API")) # → powerful

06 — Next Steps

What to Explore Next

Large Language Models are the foundation, but you need to know the underlying mechanisms and how different approaches (frontier vs open-source) solve different problems. Explore these child concept pages:

Frontier Models — State-of-the-art commercial

Deep dive into GPT-4o, Claude 3.5, Gemini 1.5 Pro, and o3: their capabilities, pricing, context windows, and when to use each. Includes model cards and SOTA benchmarks.

Open-Source LLMs — Run locally

Llama 3.1, Mistral, Qwen 2.5, Phi-3: how to download, quantize, fine-tune, and deploy. Infrastructure tradeoffs, inference optimization, and cost analysis for self-hosted models.

LLM Internals — Under the hood

Tokenization, attention mechanism, context window limits, scaling laws, and emergent abilities. Why are transformers effective? What are the theoretical limits?

07 — Further Reading

References

Foundational Papers

Paper Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165. — arxiv:2005.14165 ↗
Paper Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288. — arxiv:2307.09288 ↗
Paper Hoffmann, B. et al. (2022). Training Compute-Optimal LLMs (Chinchilla). arXiv:2203.15556. — arxiv:2203.15556 ↗

Model Cards & Documentation

Docs Anthropic. Claude Model Card & API Documentation. — anthropic.com/claude ↗
Docs OpenAI. GPT-4 and GPT-4o Documentation. — platform.openai.com/docs/models ↗
Docs Google AI. Gemini API Documentation. — ai.google.dev ↗
Docs Meta. Llama 3.1 Model Card. — llama.meta.com ↗

Benchmarks & Leaderboards

Guide LMSYS. Chatbot Arena Leaderboard. Crowdsourced comparison of LLM quality. — chat.lmsys.org ↗
Guide Hugging Face. Open LLM Leaderboard. Standardized benchmarks for open-source models. — huggingface.co ↗

LEARNING PATH

Learning Path

LLMs build on transformers, pre-training, and alignment. Here's the recommended progression:

Transformersarchitecture

→

Pre-trainingnext-token loss

→

SFTinstruction follow

→

RLHF / DPOalignment

→

Promptingusing LLMs

Understand the transformer first

LLMs are autoregressive transformers. If you haven't implemented attention yourself, start with Transformers before diving into LLM training specifics.

Grasp scaling laws

The Chinchilla paper (Hoffmann et al., 2022) shows that model size and training tokens scale together — optimal training uses ~20 tokens per parameter. This is why Llama 3 8B is trained on 15T tokens.

Learn the alignment stack

SFT makes models follow instructions. RLHF (or DPO) makes them follow human preferences. Constitutional AI adds principle-based self-critique. Each is layered on the pre-trained base.

Practice prompting before fine-tuning

95% of production LLM tasks can be solved with good prompting. Fine-tune only when you have 500+ high-quality examples and prompting has a provable ceiling. See Prompting.

Large Language Models

How LLMs Are Built

Stage 1: Pretraining

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Frontier Commercial Models

The Frontier Lineup

GPT-4o — OpenAI's multimodal flagship

Claude 3.5 Sonnet — Anthropic's balanced choice

Gemini 1.5 Pro — Google's multimodal powerhouse

o3 — OpenAI's reasoning specialist

Open-Source Models

Top Open-Source Models

Llama 3.1 — Meta's efficient powerhouse

Mistral 8x22B — Mixture-of-Experts efficiency

Qwen 2.5 — Alibaba's multilingual competitor

Phi-3 — Microsoft's distilled powerhouse

Choosing the Right Model

Decision Framework

🚀 Capability

💰 Cost

⚡ Latency

📏 Context Window

Quick Decision Tree

Working Code Example

Setup

What to Explore Next

Frontier Models — State-of-the-art commercial

Open-Source LLMs — Run locally

LLM Internals — Under the hood

References

Learning Path

Understand the transformer first

Grasp scaling laws

Learn the alignment stack

Practice prompting before fine-tuning

Related concepts