Vision-Language Models

Contents

The multimodal architecture
CLIP: the foundation
LLaVA and open VLMs
Frontier VLMs
Image tokenization and resolution
Use cases and pipeline patterns
Evaluation and benchmarks

01 — How Vision-Language Models Work

The Multimodal Architecture

Three approaches to combining vision and language: (1) late fusion (encode separately, combine at decision), (2) cross-attention (image features attend to text decoder), (3) token projection (map image patches to LLM token space)

Modern VLMs mostly use token projection: image → vision encoder → linear projection → LLM token sequence

Vision encoder: typically ViT (Vision Transformer) — split image into 14×14 or 16×16 pixel patches, embed each patch, apply transformer encoder

Typical pipeline: image → ViT → 576–2048 image tokens → linear projector → concat with text tokens → LLM → text output

VLM Architecture Components

Component	Example	Role
Vision encoder	CLIP ViT-L/14, SigLIP	Image → visual features
Projector	MLP (2 layers)	Visual features → LLM token space
LLM backbone	Llama, Mistral, Qwen	Text + visual token → text
Training	SFT on image-text pairs	Align visual and text representations

02 — Zero-Shot Vision

CLIP: The Foundation

CLIP (Contrastive Language-Image Pre-training, OpenAI 2021): train image and text encoders jointly with contrastive loss — matching image-text pairs are pulled together, non-matching pairs are pushed apart

Training data: 400M (image, text caption) pairs from the web

Zero-shot: classify images by computing similarity to text labels ("a photo of a {cat, dog, car}") — no task-specific training needed

CLIP Zero-Shot Classification

import torch from transformers import CLIPProcessor, CLIPModel from PIL import Image import requests model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") image = Image.open("photo.jpg") labels = ["a cat", "a dog", "a car", "a landscape"] inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1) # probabilities over labels print({label: f"{prob:.1%}" for label, prob in zip(labels, probs[0])})

✓ CLIP image embeddings are useful beyond classification — for semantic image search, clustering, and as features for downstream models. SigLIP (Google) is a stronger modern alternative.

03 — Open-Source Models

LLaVA and Open VLMs

LLaVA (Liu et al. 2023): connect CLIP vision encoder to LLaMA via a simple 2-layer MLP projector. Fine-tune on image-instruction-response triples generated by GPT-4.

LLaVA-1.5: higher-res input (336×336 → tiled), stronger projector (MLP vs linear), RLHF alignment → major quality jump over original LLaVA

Open VLMs Comparison

Model	Vision encoder	LLM backbone	Context	Strengths
LLaVA-1.6 (34B)	CLIP ViT-L	Nous-Hermes-Yi-34B	4K	Strong open-source
LLaVA-NeXT-Video	CLIP ViT-L	Mistral-7B	4K	Video understanding
InternVL2	InternViT-6B	InternLM2	8K	OCR, charts, documents
Qwen2-VL	ViT custom	Qwen2-7B/72B	32K	Long video, multilingual
Phi-3.5-vision	CLIP ViT	Phi-3.5-mini	128K	Efficient, on-device
PaliGemma	SigLIP	Gemma	8K	Strong on benchmarks

04 — Closed-Source Frontier

Frontier VLMs: GPT-4V, Claude, Gemini

GPT-4V / GPT-4o: images are tokenized into up to 1024 tokens (low-res) or 2048 tokens (high-res tiled). Best-in-class for complex visual reasoning, OCR, chart understanding.

Claude 3.5 Vision: strong document understanding, diagrams, screenshots, code in images. 200K context with vision.

Gemini 1.5 Pro: native multimodal — video frames, audio, images, text in one context. 1M token context handles hour-long videos.

Frontier VLM Capabilities

Model	Max image size	Video	OCR	Chart	Diagram	$/image
GPT-4o	2048px tiled	✗	✓✓	✓✓	✓✓	~$0.002
Claude 3.5 Sonnet	8K px	✗	✓✓	✓✓	✓✓	~$0.003
Gemini 1.5 Pro	Native	✓✓	✓✓	✓✓	✓✓	~$0.002
Gemini 1.5 Flash	Native	✓	✓	✓	✓	~$0.0001

05 — Cost & Quality

Image Tokenization and Resolution

The resolution-cost tradeoff: more pixels → more image tokens → more LLM input tokens → more cost and latency

Tiled encoding: split high-res image into tiles (e.g., 2×2 grid of 336×336 tiles), encode each tile separately, concatenate tokens

Dynamic resolution (InternVL, Qwen2-VL): select optimal tile grid based on image aspect ratio — portrait images get 1×3 tiles, landscape get 3×1, etc.

Estimate Image Token Cost

# GPT-4o image token calculation def estimate_image_tokens(width: int, height: int, detail: str = "high") -> int: if detail == "low": return 85 # fixed cost # Rescale to fit within 2048×2048 scale = min(2048 / max(width, height), 1.0) w, h = int(width * scale), int(height * scale) # Scale shortest side to 768 scale2 = 768 / min(w, h) w, h = int(w * scale2), int(h * scale2) # Count 512px tiles tiles_x = math.ceil(w / 512) tiles_y = math.ceil(h / 512) return 85 + 170 * (tiles_x * tiles_y) print(estimate_image_tokens(1024, 768)) # ~765 tokens print(estimate_image_tokens(4000, 3000)) # ~1360 tokens

⚠️ For document processing, use "high" detail. For quick visual Q&A or alt-text generation, use "low" detail (85 tokens flat) — 10× cheaper.

06 — In Practice

Use Cases and Pipeline Patterns

Document intelligence: extract text, tables, form fields from scanned PDFs — VLMs outperform OCR + NLP pipelines for complex layouts

Chart and diagram understanding: read graphs, flowcharts, architecture diagrams — Claude 3.5 and GPT-4o excel

Visual RAG: embed images with CLIP/SigLIP, retrieve by semantic image-text similarity, pass top-k images to VLM for generation

Screenshot analysis: UI testing, automated QA, accessibility auditing

Production Pipeline Patterns

📄 Document processing

Image → VLM (extract structured JSON)
Validate with Pydantic
Store in database
Use high-res detail for dense docs

🔍 Visual RAG

Images embedded with CLIP
Stored in vector DB
Query retrieves relevant images
VLM reasons over top-k

🔄 Multi-image reasoning

Pass multiple images in one context
Product comparison, before/after
Sequence understanding
5–20 images per request

🎬 Video frame sampling

Extract 1 frame/sec or keyframes
Pass as ordered image sequence
VLM reasons over progression
Gemini handles video natively

Python · VLM pipeline: image captioning + structured data extraction

import base64
from pathlib import Path
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

def encode_image(path: str) -> str:
    """Encode image file to base64 for API submission."""
    return base64.b64encode(Path(path).read_bytes()).decode()

def caption_image(image_path: str) -> str:
    """Generate a detailed caption for an image."""
    b64 = encode_image(image_path)
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url",
                 "image_url": {"url": f"data:image/jpeg;base64,{b64}",
                               "detail": "high"}},  # "low" = faster, cheaper
                {"type": "text",
                 "text": "Describe this image in detail. Include objects, text, colors, and layout."}
            ]
        }],
        max_tokens=512
    ).choices[0].message.content

class DocumentData(BaseModel):
    title: str
    date: str | None
    key_figures: list[str]
    summary: str

def extract_document_data(image_path: str) -> DocumentData:
    """Extract structured data from a document image."""
    b64 = encode_image(image_path)
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url",
                 "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
                {"type": "text",
                 "text": "Extract structured data from this document."}
            ]
        }],
        response_format=DocumentData,
        temperature=0.0
    )
    return result.choices[0].message.parsed

# Usage
caption = caption_image("chart.jpg")
print(f"Caption: {caption[:100]}")
doc = extract_document_data("invoice.jpg")
print(f"Title: {doc.title}, Date: {doc.date}")

07 — Measuring Performance

Evaluation and Benchmarks

VLM Benchmarks

Benchmark	What it tests	SOTA model
MMBench	General visual understanding (MCQ)	GPT-4o
OCRBench	Text recognition in images	InternVL2
ChartQA	Chart comprehension	Claude 3.5
DocVQA	Document visual QA	GPT-4o
MMMU	College-level multi-discipline	GPT-4o
Video-MME	Long video understanding	Gemini 1.5 Pro

Evaluation Best Practices

Test on your actual image types: scans vs photos vs screenshots differ significantly. Test at your target resolution: low-res "fast" mode may not work for dense documents. Evaluate OCR accuracy separately from reasoning — a model may fail at text extraction but excel at interpretation.

Tools & Frameworks

Embedding

CLIP

HuggingFace image-text encoder

Embedding

SigLIP

Google's modern CLIP alternative

Open-source

LLaVA

CLIP + LLaMA vision-language model

Open-source

InternVL2

Strong OCR and document understanding

Open-source

Qwen2-VL

Long video and multilingual support

API

OpenAI Vision API

GPT-4o image understanding

API

Anthropic Vision API

Claude 3.5 document analysis

Framework

transformers

HuggingFace model hub

Framework

timm

PyTorch Image Models

08 — Further Reading

References

Academic Papers

Paper Radford, A., Kim, J. W., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. OpenAI CLIP paper. arXiv:2103.00020. — arxiv:2103.00020 ↗
Paper Liu, H., Li, C., Wu, Q., & Yen, Y. J. (2023). Visual Instruction Tuning. LLaVA paper. arXiv:2304.08485. — arxiv:2304.08485 ↗
Paper Liu, H., Li, C., Yadav, S., & Yen, Y. J. (2023). Improved Baselines with Visual Instruction Tuning. LLaVA-1.5 paper. arXiv:2310.03744. — arxiv:2310.03744 ↗
Paper Zhu, D., Chen, J., et al. (2024). InternVL2: Scaling Open-source Vision-Language Models. InternVL2 paper. — github.com/OpenGVLab/InternVL ↗
Paper Bai, J., Bai, S., Yang, S., et al. (2024). Qwen2-VL: Enhancing Vision Language Understanding with Multi-Resolution Tiling and Video Understanding. Qwen2-VL paper. arXiv:2409.12191. — arxiv:2409.12191 ↗
Paper Zhai, X., Mustafa, B., et al. (2023). SigLIP: Sigmoid Loss is All You Need for Large Scale Image-Text Matching. Google SigLIP paper. arXiv:2303.15343. — arxiv:2303.15343 ↗

Documentation & Guides

Docs HuggingFace Transformers — Vision models. huggingface.co/docs/transformers ↗
Docs OpenAI Vision API. platform.openai.com/docs/guides/vision ↗
Docs Anthropic Vision API. docs.anthropic.com/en/docs/vision ↗
Docs Gemini Multimodal API. ai.google.dev ↗

Practitioner Resources

Blog OpenGVLab. (2024). InternVL: Towards Open-source OCR and Diagram Understanding. — github.com/OpenGVLab/InternVL ↗
Guide LLaVA Official. Visual Instruction Tuning and Fine-tuning Guide. — github.com/haotian-liu/LLaVA ↗

Vision-Language Models

The Multimodal Architecture

VLM Architecture Components

CLIP: The Foundation

CLIP Zero-Shot Classification

LLaVA and Open VLMs

Open VLMs Comparison

Frontier VLMs: GPT-4V, Claude, Gemini

Frontier VLM Capabilities

Image Tokenization and Resolution

Estimate Image Token Cost

Use Cases and Pipeline Patterns

Production Pipeline Patterns

📄 Document processing

🔍 Visual RAG

🔄 Multi-image reasoning

🎬 Video frame sampling

Evaluation and Benchmarks

VLM Benchmarks

Evaluation Best Practices

Tools & Frameworks

References

Related concepts