Frontier Models

Gemini 1.5 Pro

Google's Gemini 1.5 Pro features a 1M token context window — the largest of any production model. Mixture-of-Experts architecture, native multimodal (text + image + audio + video), and strong coding and reasoning performance.

1M tokens
Context window
MoE
Architecture
Native
Multimodal

Table of Contents

SECTION 01

Gemini 1.5 Pro overview

Gemini 1.5 Pro (released February 2024) is Google DeepMind's most capable publicly available model as of mid-2024. Its defining feature is a 1 million token context window — enough to fit the entire Harry Potter series (1.1M words), a full codebase, or 10 hours of audio in a single prompt.

Architecture: a sparse Mixture-of-Experts transformer, similar in concept to Mixtral but at frontier scale. MoE allows high quality with lower inference cost per token than a comparably capable dense model. The architecture details are not fully disclosed.

Capabilities: text, images, audio, video, and documents all as native input modalities in a single unified model — no separate vision or audio encoders called separately.

SECTION 02

The 1M context window

The 1M context window enables qualitatively new use cases that smaller windows don't support:

In practice, "needle-in-a-haystack" benchmarks show Gemini 1.5 Pro maintains ~98% recall across the full 1M context, though reasoning quality on very long contexts can still degrade for complex multi-hop questions.

SECTION 03

Using the Gemini API

import google.generativeai as genai

genai.configure(api_key="your-api-key")
model = genai.GenerativeModel("gemini-1.5-pro")

# Text generation
response = model.generate_content(
    "Explain the difference between RAG and fine-tuning in 3 bullet points."
)
print(response.text)

# With system instruction
model = genai.GenerativeModel(
    "gemini-1.5-pro",
    system_instruction="You are a concise technical writer. Always use code examples.",
)

# Multi-turn chat
chat = model.start_chat()
r1 = chat.send_message("What is attention in transformers?")
r2 = chat.send_message("Now show me a NumPy implementation.")
print(r2.text)

# Generation config
response = model.generate_content(
    "Write a haiku about gradient descent.",
    generation_config=genai.types.GenerationConfig(
        temperature=0.9, max_output_tokens=100, top_p=0.95),
)
SECTION 04

Multimodal capabilities

import google.generativeai as genai
from PIL import Image
import requests
from io import BytesIO

genai.configure(api_key="your-api-key")
model = genai.GenerativeModel("gemini-1.5-pro")

# Image understanding
img = Image.open("architecture_diagram.png")
response = model.generate_content([
    img,
    "Describe this architecture diagram and identify any potential bottlenecks.",
])
print(response.text)

# PDF analysis (send raw bytes)
with open("research_paper.pdf", "rb") as f:
    pdf_data = f.read()

response = model.generate_content([
    {"mime_type": "application/pdf", "data": pdf_data},
    "What are the 3 main contributions of this paper?",
])

# Audio transcription + analysis
with open("meeting.mp3", "rb") as f:
    audio_data = f.read()

response = model.generate_content([
    {"mime_type": "audio/mp3", "data": audio_data},
    "Summarise this meeting and list all action items.",
])
SECTION 05

Gemini 1.5 Flash

Gemini 1.5 Flash is a smaller, faster, cheaper variant optimised for high-throughput tasks where cost matters more than maximum quality. It maintains the 1M context window and multimodal capabilities but runs faster and costs ~10× less than Pro.

model = genai.GenerativeModel("gemini-1.5-flash")

# Flash is good for:
# - Classification tasks (sentiment, routing)
# - Document summarisation
# - Quick Q&A over documents
# - High-volume pipelines

# Pricing comparison (approximate):
# gemini-1.5-pro:   $3.50/1M input tokens, $10.50/1M output
# gemini-1.5-flash: $0.35/1M input tokens, $1.05/1M output
# gemini-1.5-flash-8b: $0.037/1M input (cheapest in family)

response = model.generate_content("Classify this review as positive/negative: 'Great product!'")
print(response.text)  # Positive
SECTION 06

Context caching for cost efficiency

With a 1M context window, sending a large document with every request becomes expensive. Gemini's context caching lets you upload a document once, get a cache handle, and reuse it across many queries at a fraction of the cost.

import google.generativeai as genai
from google.generativeai import caching
import datetime

genai.configure(api_key="your-api-key")

# Upload and cache a large document
with open("large_codebase.txt") as f:
    codebase = f.read()

cache = caching.CachedContent.create(
    model="models/gemini-1.5-pro",
    contents=[{"role": "user", "parts": [{"text": codebase}]}],
    ttl=datetime.timedelta(hours=1),
    display_name="my-codebase-cache",
)

# Use the cache for multiple queries at reduced cost
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
r1 = model.generate_content("Find all TODO comments in this codebase.")
r2 = model.generate_content("What are the main entry points?")
r3 = model.generate_content("List all database queries.")
# The codebase tokens are only billed once
SECTION 07

Gotchas

Rate limits on long contexts: The 1M context is available but rate-limited heavily. Free tier allows only a few requests per minute with long contexts. Production use requires paid tier.

Context window ≠ context quality: Gemini 1.5 Pro can recall facts from 1M tokens, but complex reasoning over very long contexts still degrades. For multi-hop questions requiring information from many scattered locations, retrieval augmentation often beats raw long context.

Pricing with long contexts: At $3.50/1M input tokens, sending a 100K-token document per query costs $0.35 per call. Budget this before building long-context pipelines. Context caching (see above) is essential for cost control.

API vs Vertex AI: The Google AI Studio API (api.generativeai.google.com) is simpler for prototyping. Vertex AI is recommended for production: better SLAs, IAM-based auth, data residency controls, and enterprise support.

Gemini 1.5 Family Comparison

Google's Gemini 1.5 family introduced native 1M+ token context windows as a flagship capability, enabling use cases like full codebase analysis, complete book processing, and hour-long video understanding that were impractical with context-limited predecessors. The family balances context capability with deployment flexibility through Pro and Flash variants.

ModelContext WindowModalitiesSpeedBest For
Gemini 1.5 Pro2M tokensText, image, video, audioMediumComplex long-context tasks
Gemini 1.5 Flash1M tokensText, image, video, audioFastHigh-volume production
Gemini 1.5 Flash-8B1M tokensText, imageFastestCost-sensitive workloads

Gemini 1.5's multimodal long-context capability extends beyond text to native video and audio understanding within the same context window. A one-hour video submitted as input is processed as approximately 1 million tokens (at ~1 frame per second plus audio transcription), enabling questions that require synthesizing information across the full video timeline. This native multimodal long context eliminates the need for separate video segmentation, transcription, and retrieval pipelines that were necessary to make long-form video queryable with previous generation models.

Context caching in Gemini 1.5 provides significant cost savings for applications that repeatedly use the same large context — a full codebase, a long legal document, a complete product catalog. The cache stores the processed KV state for a specified prefix, charged at a reduced rate for subsequent requests that reuse the cached prefix. For applications where the document corpus is static but queries change frequently, context caching can reduce per-query costs by 75–90% compared to re-processing the full context on every request.

Gemini 1.5's performance on needle-in-a-haystack retrieval tasks at 1M+ token contexts established a new benchmark for long-context recall quality. While earlier long-context models showed significant accuracy degradation when the target information was located in the middle of a very long context, Gemini 1.5 Pro maintained high recall accuracy across the full context length. This capability enables use cases like "find all instances of compliance violations in this 500-page contract" or "trace every reference to this function across the full codebase" that require reliable retrieval from contexts too long for any chunking-based RAG approach.

Gemini 1.5's function calling and code execution capabilities, combined with its long context window, enable a powerful workflow for data analysis on large datasets. A full CSV file (up to millions of rows) can be uploaded as context, and the model can write and execute Python analysis code, interpret the results, and iterate on the analysis in a single session without external tool calls. This inline code execution approach, supported through the Gemini API's code interpreter tool, significantly reduces the infrastructure complexity of building AI-driven data analysis products compared to architectures requiring separate code execution sandboxes and data transfer pipelines.

Pricing for Gemini 1.5 Pro at the 2M token context scale requires careful cost modeling. Long context requests are priced per token including all input tokens, meaning a 2M token context submitted with a 100-token query incurs the cost of 2,000,100 input tokens. Applications that repeatedly query the same large document benefit enormously from context caching, which charges the initial context at full price but subsequent queries reusing the same context at approximately 25% of the standard input token price. Modeling the query frequency against the cache invalidation rate determines the optimal caching strategy for a given application pattern.

Gemini 1.5's system instruction feature functions similarly to the system prompt in OpenAI and Anthropic APIs, providing persistent role and behavior guidance that precedes the conversation history in every request. For applications that use the same Gemini configuration across thousands of requests, placing the system instructions in a cached context prefix reduces input token costs for the instruction portion. The system instruction's position before any user content ensures it receives high-weight processing in the model's attention, maintaining consistent behavior across long conversations even when the user content fills most of the context window.