Multimodal AI

Vision-Language Models

How LLMs see images — CLIP, LLaVA, GPT-4V, multimodal architectures, and production patterns

image → tokens → LLM the pipeline
CLIP contrastive training the foundational method
1K–2K image tokens typical cost per image
Contents
  1. The multimodal architecture
  2. CLIP: the foundation
  3. LLaVA and open VLMs
  4. Frontier VLMs
  5. Image tokenization and resolution
  6. Use cases and pipeline patterns
  7. Evaluation and benchmarks
01 — How Vision-Language Models Work

The Multimodal Architecture

Three approaches to combining vision and language: (1) late fusion (encode separately, combine at decision), (2) cross-attention (image features attend to text decoder), (3) token projection (map image patches to LLM token space)

Modern VLMs mostly use token projection: image → vision encoder → linear projection → LLM token sequence

Vision encoder: typically ViT (Vision Transformer) — split image into 14×14 or 16×16 pixel patches, embed each patch, apply transformer encoder

Typical pipeline: image → ViT → 576–2048 image tokens → linear projector → concat with text tokens → LLM → text output

VLM Architecture Components

ComponentExampleRole
Vision encoderCLIP ViT-L/14, SigLIPImage → visual features
ProjectorMLP (2 layers)Visual features → LLM token space
LLM backboneLlama, Mistral, QwenText + visual token → text
TrainingSFT on image-text pairsAlign visual and text representations
02 — Zero-Shot Vision

CLIP: The Foundation

CLIP (Contrastive Language-Image Pre-training, OpenAI 2021): train image and text encoders jointly with contrastive loss — matching image-text pairs are pulled together, non-matching pairs are pushed apart

Training data: 400M (image, text caption) pairs from the web

Zero-shot: classify images by computing similarity to text labels ("a photo of a {cat, dog, car}") — no task-specific training needed

CLIP Zero-Shot Classification

import torch from transformers import CLIPProcessor, CLIPModel from PIL import Image import requests model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") image = Image.open("photo.jpg") labels = ["a cat", "a dog", "a car", "a landscape"] inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1) # probabilities over labels print({label: f"{prob:.1%}" for label, prob in zip(labels, probs[0])})
CLIP image embeddings are useful beyond classification — for semantic image search, clustering, and as features for downstream models. SigLIP (Google) is a stronger modern alternative.
03 — Open-Source Models

LLaVA and Open VLMs

LLaVA (Liu et al. 2023): connect CLIP vision encoder to LLaMA via a simple 2-layer MLP projector. Fine-tune on image-instruction-response triples generated by GPT-4.

LLaVA-1.5: higher-res input (336×336 → tiled), stronger projector (MLP vs linear), RLHF alignment → major quality jump over original LLaVA

Open VLMs Comparison

ModelVision encoderLLM backboneContextStrengths
LLaVA-1.6 (34B)CLIP ViT-LNous-Hermes-Yi-34B4KStrong open-source
LLaVA-NeXT-VideoCLIP ViT-LMistral-7B4KVideo understanding
InternVL2InternViT-6BInternLM28KOCR, charts, documents
Qwen2-VLViT customQwen2-7B/72B32KLong video, multilingual
Phi-3.5-visionCLIP ViTPhi-3.5-mini128KEfficient, on-device
PaliGemmaSigLIPGemma8KStrong on benchmarks
04 — Closed-Source Frontier

Frontier VLMs: GPT-4V, Claude, Gemini

GPT-4V / GPT-4o: images are tokenized into up to 1024 tokens (low-res) or 2048 tokens (high-res tiled). Best-in-class for complex visual reasoning, OCR, chart understanding.

Claude 3.5 Vision: strong document understanding, diagrams, screenshots, code in images. 200K context with vision.

Gemini 1.5 Pro: native multimodal — video frames, audio, images, text in one context. 1M token context handles hour-long videos.

Frontier VLM Capabilities

ModelMax image sizeVideoOCRChartDiagram$/image
GPT-4o2048px tiled✓✓✓✓✓✓~$0.002
Claude 3.5 Sonnet8K px✓✓✓✓✓✓~$0.003
Gemini 1.5 ProNative✓✓✓✓✓✓✓✓~$0.002
Gemini 1.5 FlashNative~$0.0001
05 — Cost & Quality

Image Tokenization and Resolution

The resolution-cost tradeoff: more pixels → more image tokens → more LLM input tokens → more cost and latency

Tiled encoding: split high-res image into tiles (e.g., 2×2 grid of 336×336 tiles), encode each tile separately, concatenate tokens

Dynamic resolution (InternVL, Qwen2-VL): select optimal tile grid based on image aspect ratio — portrait images get 1×3 tiles, landscape get 3×1, etc.

Estimate Image Token Cost

# GPT-4o image token calculation def estimate_image_tokens(width: int, height: int, detail: str = "high") -> int: if detail == "low": return 85 # fixed cost # Rescale to fit within 2048×2048 scale = min(2048 / max(width, height), 1.0) w, h = int(width * scale), int(height * scale) # Scale shortest side to 768 scale2 = 768 / min(w, h) w, h = int(w * scale2), int(h * scale2) # Count 512px tiles tiles_x = math.ceil(w / 512) tiles_y = math.ceil(h / 512) return 85 + 170 * (tiles_x * tiles_y) print(estimate_image_tokens(1024, 768)) # ~765 tokens print(estimate_image_tokens(4000, 3000)) # ~1360 tokens
⚠️ For document processing, use "high" detail. For quick visual Q&A or alt-text generation, use "low" detail (85 tokens flat) — 10× cheaper.
06 — In Practice

Use Cases and Pipeline Patterns

Document intelligence: extract text, tables, form fields from scanned PDFs — VLMs outperform OCR + NLP pipelines for complex layouts

Chart and diagram understanding: read graphs, flowcharts, architecture diagrams — Claude 3.5 and GPT-4o excel

Visual RAG: embed images with CLIP/SigLIP, retrieve by semantic image-text similarity, pass top-k images to VLM for generation

Screenshot analysis: UI testing, automated QA, accessibility auditing

Production Pipeline Patterns

📄 Document processing

  • Image → VLM (extract structured JSON)
  • Validate with Pydantic
  • Store in database
  • Use high-res detail for dense docs

🔍 Visual RAG

  • Images embedded with CLIP
  • Stored in vector DB
  • Query retrieves relevant images
  • VLM reasons over top-k

🔄 Multi-image reasoning

  • Pass multiple images in one context
  • Product comparison, before/after
  • Sequence understanding
  • 5–20 images per request

🎬 Video frame sampling

  • Extract 1 frame/sec or keyframes
  • Pass as ordered image sequence
  • VLM reasons over progression
  • Gemini handles video natively
Python · VLM pipeline: image captioning + structured data extraction
import base64
from pathlib import Path
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

def encode_image(path: str) -> str:
    """Encode image file to base64 for API submission."""
    return base64.b64encode(Path(path).read_bytes()).decode()

def caption_image(image_path: str) -> str:
    """Generate a detailed caption for an image."""
    b64 = encode_image(image_path)
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url",
                 "image_url": {"url": f"data:image/jpeg;base64,{b64}",
                               "detail": "high"}},  # "low" = faster, cheaper
                {"type": "text",
                 "text": "Describe this image in detail. Include objects, text, colors, and layout."}
            ]
        }],
        max_tokens=512
    ).choices[0].message.content

class DocumentData(BaseModel):
    title: str
    date: str | None
    key_figures: list[str]
    summary: str

def extract_document_data(image_path: str) -> DocumentData:
    """Extract structured data from a document image."""
    b64 = encode_image(image_path)
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url",
                 "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
                {"type": "text",
                 "text": "Extract structured data from this document."}
            ]
        }],
        response_format=DocumentData,
        temperature=0.0
    )
    return result.choices[0].message.parsed

# Usage
caption = caption_image("chart.jpg")
print(f"Caption: {caption[:100]}")
doc = extract_document_data("invoice.jpg")
print(f"Title: {doc.title}, Date: {doc.date}")
07 — Measuring Performance

Evaluation and Benchmarks

VLM Benchmarks

BenchmarkWhat it testsSOTA model
MMBenchGeneral visual understanding (MCQ)GPT-4o
OCRBenchText recognition in imagesInternVL2
ChartQAChart comprehensionClaude 3.5
DocVQADocument visual QAGPT-4o
MMMUCollege-level multi-disciplineGPT-4o
Video-MMELong video understandingGemini 1.5 Pro

Evaluation Best Practices

Test on your actual image types: scans vs photos vs screenshots differ significantly. Test at your target resolution: low-res "fast" mode may not work for dense documents. Evaluate OCR accuracy separately from reasoning — a model may fail at text extraction but excel at interpretation.

Tools & Frameworks

Embedding
CLIP
HuggingFace image-text encoder
Embedding
SigLIP
Google's modern CLIP alternative
Open-source
LLaVA
CLIP + LLaMA vision-language model
Open-source
InternVL2
Strong OCR and document understanding
Open-source
Qwen2-VL
Long video and multilingual support
API
OpenAI Vision API
GPT-4o image understanding
API
Anthropic Vision API
Claude 3.5 document analysis
Framework
transformers
HuggingFace model hub
Framework
timm
PyTorch Image Models
08 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Resources