LLM Capabilities

Context Window

The context window defines how much text an LLM can process in one forward pass. From 4K to 1M tokens — understanding context length, the KV-cache cost, position encoding limits, and the 'lost in the middle' failure mode.

4K → 1M
Range today
O(n²)
KV-cache memory
'Lost in middle'
Key failure mode

Table of Contents

SECTION 01

What the context window is

The context window (also called context length) is the maximum number of tokens an LLM can attend to in a single forward pass. Input tokens (your prompt + conversation history) and output tokens (the generated response) together must fit within this limit.

Tokens are roughly 3/4 of a word in English: "hello world" = 2 tokens, a typical paragraph = 100–150 tokens, a typical A4 page = 400–600 tokens. Common benchmarks: 4K tokens ā‰ˆ 3,000 words, 128K tokens ā‰ˆ 96,000 words ā‰ˆ a short novel, 1M tokens ā‰ˆ 750,000 words ā‰ˆ the full Harry Potter series.

The context window is not just about length — it's about what the model can reason over. Everything in the context is attended to simultaneously (O(n²) attention), and positions early in the context can affect positions late in the context. This is fundamentally different from a database query or search index.

SECTION 02

KV-cache and memory cost

During autoregressive generation, the model computes key and value tensors for every token in the context at every layer. These are cached to avoid recomputation — the KV-cache. Memory cost scales as:

KV-cache bytes = 2 Ɨ num_layers Ɨ num_kv_heads Ɨ head_dim Ɨ context_length Ɨ bytes_per_element

For Llama 3.1 70B (fp16): 80 layers Ɨ 8 KV heads Ɨ 128 head_dim Ɨ 2 bytes Ɨ 2 (K+V) = 328 bytes per token. At 128K tokens: 328 Ɨ 128,000 = 42GB of KV-cache alone — more than the model weights.

def kv_cache_gb(num_layers, num_kv_heads, head_dim, seq_len, dtype_bytes=2):
    # Returns KV-cache memory in GB
    kv_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
    return kv_per_token * seq_len / 1e9

# Llama 3.1 70B
print(f"128K context: {kv_cache_gb(80, 8, 128, 128_000):.1f} GB")
print(f"8K context:   {kv_cache_gb(80, 8, 128, 8_000):.1f} GB")
# 128K context: 41.9 GB
# 8K context:    2.6 GB

This is why serving infrastructure matters: long context requests are memory-hungry and require careful batching to avoid OOM.

SECTION 03

Position encoding limits

LLMs learn position information through positional encodings (RoPE, ALiBi, or absolute). These encodings are trained on sequences up to the training context length. At inference time, extrapolating beyond the training length can degrade quality.

RoPE (Rotary Position Embedding): Used by Llama, Mistral. Can be extrapolated beyond training length using scaling techniques (YaRN, LongRoPE). Models like Llama 3.1 use YaRN scaling to extend 8K training length to 128K inference length.

ALiBi: Uses attention biases that scale linearly with distance. Better extrapolation than learned absolute positions, but still degrades for very long contexts.

The practical implication: a model advertised as "128K context" may actually extrapolate a shorter training context. Check whether quality benchmarks at long context are included in the model card.

SECTION 04

Lost in the middle

A surprising finding (Liu et al. 2023): LLMs perform best when relevant information is at the start or end of the context, and worst when it's in the middle. Performance degrades roughly in a U-shape across position.

# Demonstrate "lost in the middle" with a simple test
import openai

def ask_with_position(relevant_doc, irrelevant_docs_before, irrelevant_docs_after, question):
    context = "

".join(
        irrelevant_docs_before + [relevant_doc] + irrelevant_docs_after
    )
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role":"user","content":f"{context}

Q: {question}"}],
    ).choices[0].message.content

# Quality drops when relevant_doc is buried in the middle:
# Position 0/10 docs: ~90% accuracy
# Position 5/10 docs: ~65% accuracy (middle = worst)
# Position 10/10 docs: ~85% accuracy

Mitigations: put the most important context at the start or end, use retrieval to return only the most relevant chunks, or re-rank retrieved chunks to place the best ones at the boundaries.

SECTION 05

Long context strategies

Given the cost and quality trade-offs, here's when to use different approaches:

Use long context directly when: you need to reason over the relationships between many parts of a document (not just retrieve a fact), the document is <50K tokens (cost manageable), or you're prototyping and don't want to build a retrieval pipeline.

Use RAG instead when: the total corpus is too large to fit in context, you need to search across many documents, cost is a concern (>50K tokens per query gets expensive), or you need to cite specific sources.

Hybrid approach: retrieve the top-K relevant chunks, then pass them in the context. This combines the scalability of retrieval with the reasoning power of in-context processing.

Chunk-and-summarise: For very long documents, recursively summarise chunks and query the summaries. Loses detail but scales to arbitrarily long inputs.

SECTION 06

Model comparison

ModelContextNotes
GPT-4o128KOpenAI flagship
Claude 3.5 Sonnet200KBest long-context quality
Gemini 1.5 Pro1MLargest available context
Llama 3.1 70B128KBest open-weight long-context
Mistral 7B v0.332KSmall model long context
Qwen 2.5 72B128KStrong multilingual long ctx
SECTION 07

Gotchas

Input vs output tokens: Context windows include both input and output. If your context is 100K tokens, you have little room left for output. Always reserve capacity: if context window = 128K and input = 120K, max output = 8K.

Counting tokens: Use the model's tokenizer to count before sending. A request that exceeds the context limit will fail with an API error. For OpenAI models, use the tiktoken library. For others, use the HF tokenizer.

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
token_count = len(enc.encode(your_text))
print(f"Token count: {token_count}")

Billing: You pay for the full context on every API call. With a 50K-token system prompt and 128K context models, sending one message costs you 50K input tokens even if the user typed two words.

Context Window Utilization Strategies

Effective context window utilization balances the need to include relevant information with the practical limits of attention quality degradation at long contexts and the linear cost scaling of most LLM APIs. Strategic placement and selective inclusion of content maximizes useful reasoning output per token spent.

StrategyMechanismToken EfficiencyQuality Impact
Front-loading key infoPut critical content firstNeutralHigh (attention primacy)
RAG retrievalInclude only relevant chunksHighHigh if precision good
SummarizationCompress long inputsHighMedium (info loss)
Prompt cachingReuse static prefixHigh (cost)None
Full contextInclude everythingLowDepends on model

The "lost in the middle" phenomenon, documented in research on long-context models, describes how LLMs tend to give less weight to information positioned in the middle of a long context compared to information at the beginning or end. For RAG systems that retrieve multiple documents, ordering retrieved chunks so that the most relevant content appears at the beginning or end of the context — rather than buried in the middle — can improve answer quality on the same retrieved content without any retrieval or model changes.

Prompt caching exploits the fact that many production applications use the same system prompt, instructions, or background document for many different user queries. Providers like Anthropic and Google cache the KV computation for repeated prompt prefixes, charging reduced rates for cache hits. Structuring prompts to place the stable content (system prompt, background documents) before the dynamic content (user query) maximizes cache hit rates and reduces both latency and cost proportionally to the length of the reused prefix.

Needle-in-a-haystack tests evaluate whether a model can retrieve specific facts from various positions within a long context. These tests insert a short target fact (the "needle") at different depths in a large filler document (the "haystack") and ask the model to retrieve the fact, generating a heat map of retrieval accuracy versus position and context length. Models that appear strong on standard benchmarks can show significant accuracy degradation at specific context positions, providing a practical calibration tool for deciding what context lengths to trust for production use cases that require reliable information retrieval from long contexts.

Multi-turn conversation context management requires deciding how much history to include in each turn's context window. Including all history preserves perfect conversation coherence but grows the context linearly with conversation length, eventually hitting the window limit and incurring increasing costs. Sliding window approaches include only the most recent N turns; summary-augmented approaches condense old turns into a summary that is prepended to the recent turns. For most conversational applications, 8–16 turns of verbatim history plus a brief summary of earlier conversation provides adequate coherence without unbounded context growth.

Context window utilization metrics help quantify whether a RAG pipeline is providing sufficient or excessive context. Tracking the token count distribution of the context passed to the LLM — and correlating it with answer quality scores — identifies whether quality improvements come from adding more context or whether the current context is already sufficient and quality improvements require better retrieval precision instead. Applications consistently using less than 20% of the available context window may benefit from increasing retrieved chunk count; applications consistently near the context limit should focus on retrieval quality rather than quantity.

Effective context window management in agentic pipelines requires pruning tool call results that are no longer relevant to the current task. As an agent executes multiple tool calls over a long task, the accumulated results of early calls may occupy significant context space while providing little value for the remaining steps. Summarizing or dropping stale tool results once their information has been incorporated into the agent's plan frees context capacity for fresh tool outputs and maintains the signal-to-noise ratio of the context window throughout a long execution.