How LLMs see images — CLIP, LLaVA, GPT-4V, multimodal architectures, and production patterns
Three approaches to combining vision and language: (1) late fusion (encode separately, combine at decision), (2) cross-attention (image features attend to text decoder), (3) token projection (map image patches to LLM token space)
Modern VLMs mostly use token projection: image → vision encoder → linear projection → LLM token sequence
Vision encoder: typically ViT (Vision Transformer) — split image into 14×14 or 16×16 pixel patches, embed each patch, apply transformer encoder
Typical pipeline: image → ViT → 576–2048 image tokens → linear projector → concat with text tokens → LLM → text output
| Component | Example | Role |
|---|---|---|
| Vision encoder | CLIP ViT-L/14, SigLIP | Image → visual features |
| Projector | MLP (2 layers) | Visual features → LLM token space |
| LLM backbone | Llama, Mistral, Qwen | Text + visual token → text |
| Training | SFT on image-text pairs | Align visual and text representations |
CLIP (Contrastive Language-Image Pre-training, OpenAI 2021): train image and text encoders jointly with contrastive loss — matching image-text pairs are pulled together, non-matching pairs are pushed apart
Training data: 400M (image, text caption) pairs from the web
Zero-shot: classify images by computing similarity to text labels ("a photo of a {cat, dog, car}") — no task-specific training needed
LLaVA (Liu et al. 2023): connect CLIP vision encoder to LLaMA via a simple 2-layer MLP projector. Fine-tune on image-instruction-response triples generated by GPT-4.
LLaVA-1.5: higher-res input (336×336 → tiled), stronger projector (MLP vs linear), RLHF alignment → major quality jump over original LLaVA
| Model | Vision encoder | LLM backbone | Context | Strengths |
|---|---|---|---|---|
| LLaVA-1.6 (34B) | CLIP ViT-L | Nous-Hermes-Yi-34B | 4K | Strong open-source |
| LLaVA-NeXT-Video | CLIP ViT-L | Mistral-7B | 4K | Video understanding |
| InternVL2 | InternViT-6B | InternLM2 | 8K | OCR, charts, documents |
| Qwen2-VL | ViT custom | Qwen2-7B/72B | 32K | Long video, multilingual |
| Phi-3.5-vision | CLIP ViT | Phi-3.5-mini | 128K | Efficient, on-device |
| PaliGemma | SigLIP | Gemma | 8K | Strong on benchmarks |
GPT-4V / GPT-4o: images are tokenized into up to 1024 tokens (low-res) or 2048 tokens (high-res tiled). Best-in-class for complex visual reasoning, OCR, chart understanding.
Claude 3.5 Vision: strong document understanding, diagrams, screenshots, code in images. 200K context with vision.
Gemini 1.5 Pro: native multimodal — video frames, audio, images, text in one context. 1M token context handles hour-long videos.
| Model | Max image size | Video | OCR | Chart | Diagram | $/image |
|---|---|---|---|---|---|---|
| GPT-4o | 2048px tiled | ✗ | ✓✓ | ✓✓ | ✓✓ | ~$0.002 |
| Claude 3.5 Sonnet | 8K px | ✗ | ✓✓ | ✓✓ | ✓✓ | ~$0.003 |
| Gemini 1.5 Pro | Native | ✓✓ | ✓✓ | ✓✓ | ✓✓ | ~$0.002 |
| Gemini 1.5 Flash | Native | ✓ | ✓ | ✓ | ✓ | ~$0.0001 |
The resolution-cost tradeoff: more pixels → more image tokens → more LLM input tokens → more cost and latency
Tiled encoding: split high-res image into tiles (e.g., 2×2 grid of 336×336 tiles), encode each tile separately, concatenate tokens
Dynamic resolution (InternVL, Qwen2-VL): select optimal tile grid based on image aspect ratio — portrait images get 1×3 tiles, landscape get 3×1, etc.
Document intelligence: extract text, tables, form fields from scanned PDFs — VLMs outperform OCR + NLP pipelines for complex layouts
Chart and diagram understanding: read graphs, flowcharts, architecture diagrams — Claude 3.5 and GPT-4o excel
Visual RAG: embed images with CLIP/SigLIP, retrieve by semantic image-text similarity, pass top-k images to VLM for generation
Screenshot analysis: UI testing, automated QA, accessibility auditing
import base64
from pathlib import Path
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
def encode_image(path: str) -> str:
"""Encode image file to base64 for API submission."""
return base64.b64encode(Path(path).read_bytes()).decode()
def caption_image(image_path: str) -> str:
"""Generate a detailed caption for an image."""
b64 = encode_image(image_path)
return client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}",
"detail": "high"}}, # "low" = faster, cheaper
{"type": "text",
"text": "Describe this image in detail. Include objects, text, colors, and layout."}
]
}],
max_tokens=512
).choices[0].message.content
class DocumentData(BaseModel):
title: str
date: str | None
key_figures: list[str]
summary: str
def extract_document_data(image_path: str) -> DocumentData:
"""Extract structured data from a document image."""
b64 = encode_image(image_path)
result = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
{"type": "text",
"text": "Extract structured data from this document."}
]
}],
response_format=DocumentData,
temperature=0.0
)
return result.choices[0].message.parsed
# Usage
caption = caption_image("chart.jpg")
print(f"Caption: {caption[:100]}")
doc = extract_document_data("invoice.jpg")
print(f"Title: {doc.title}, Date: {doc.date}")
| Benchmark | What it tests | SOTA model |
|---|---|---|
| MMBench | General visual understanding (MCQ) | GPT-4o |
| OCRBench | Text recognition in images | InternVL2 |
| ChartQA | Chart comprehension | Claude 3.5 |
| DocVQA | Document visual QA | GPT-4o |
| MMMU | College-level multi-discipline | GPT-4o |
| Video-MME | Long video understanding | Gemini 1.5 Pro |
Test on your actual image types: scans vs photos vs screenshots differ significantly. Test at your target resolution: low-res "fast" mode may not work for dense documents. Evaluate OCR accuracy separately from reasoning — a model may fail at text extraction but excel at interpretation.