Vision-Language Models

LLaVA

Visual instruction tuning: CLIP visual encoder + Llama text decoder bridged by a trainable projection layer. The first widely adopted open VLM, inspiring LLaVA-1.5, LLaVA-NeXT, and dozens of variants.

CLIP + LLM
Vision-language bridge
Visual instruct
Tuned on GPT-4 captions
Open source
Widely adopted family

Table of Contents

SECTION 01

LLaVA architecture

LLaVA (Visual Instruction Tuning, Liu et al. 2023) combines a frozen CLIP visual encoder with a large language model (originally Vicuna/Llama) connected by a trainable linear projection layer. The architecture is elegantly simple:

  1. Visual encoder: CLIP ViT-L/14 encodes the image into visual features (a grid of 256 patch tokens)
  2. Projection: A learnable linear layer maps CLIP features to the LLM's embedding space
  3. Language model: Visual tokens are prepended to the text tokens and processed by the LLM

Only the projection layer (and the LLM) are trained; CLIP is frozen throughout. LLaVA-1.5 upgraded the projection to a two-layer MLP, boosting performance significantly.

SECTION 02

Training pipeline

LLaVA uses a two-stage training process:

Stage 1 — Feature alignment: Train only the projection layer on image-caption pairs (CC3M). The goal is to map CLIP features into the LLM's embedding space. The LLM is frozen. This stage takes ~1 hour on 8 A100s.

Stage 2 — Visual instruction tuning: Fine-tune both the projection layer and the LLM (using LoRA or full fine-tuning) on multimodal instruction-following data. LLaVA's original dataset was generated by feeding COCO image captions to GPT-4 and asking it to generate instruction-following conversations — a surprisingly effective synthetic data strategy.

SECTION 03

Running LLaVA inference

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image
import torch, requests

model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Load image (local file or URL)
image = Image.open("chart.png")
# or: image = Image.open(requests.get(url, stream=True).raw)

# Build conversation
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What does this chart show? Summarise the key trends."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512, do_sample=False)

# Decode only the new tokens
new_tokens = output[0][inputs["input_ids"].shape[1]:]
print(processor.decode(new_tokens, skip_special_tokens=True))
SECTION 04

LLaVA-1.5 and LLaVA-NeXT

LLaVA-1.5: Upgraded projection from linear to 2-layer MLP, switched to a larger CLIP encoder, and used higher-quality instruction data. Achieved state-of-the-art among open VLMs in 2023 at just 13B total parameters. LLaVA-NeXT (v1.6): Added dynamic high-resolution processing — images are split into tiles (up to 4× the base resolution), each processed by CLIP separately, then merged. This allows reading fine text and detailed charts that lower-resolution VLMs miss. LLaVA-NeXT with Mistral-7B or LLaMA-3-8B backends is the practical open-source choice for most vision tasks today.

SECTION 05

Building a VQA pipeline

from pathlib import Path
import json

def analyze_images(image_paths: list[str], question: str) -> list[dict]:
    results = []
    for img_path in image_paths:
        image = Image.open(img_path)
        conversation = [{"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": question},
        ]}]
        prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
        inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
        new_tokens = output[0][inputs["input_ids"].shape[1]:]
        answer = processor.decode(new_tokens, skip_special_tokens=True)
        results.append({"image": img_path, "question": question, "answer": answer})
    return results

# Example: extract structured data from invoice images
invoices = ["invoice_001.png", "invoice_002.png"]
results = analyze_images(invoices, "Extract the total amount, vendor name, and date as JSON.")
print(json.dumps(results, indent=2))
SECTION 06

LLaVA vs other VLMs

SECTION 07

Gotchas

LLaVA model comparison and selection

The LLaVA model family has evolved significantly from the original 7B model to LLaVA-NeXT with support for higher resolutions and stronger reasoning. Selecting the right variant requires balancing capability against hardware requirements. The 7B variants fit on a single consumer GPU (16GB VRAM) and handle most visual understanding tasks adequately. The 13B and 34B variants provide meaningfully better performance on complex visual reasoning, document understanding, and chart interpretation tasks, at the cost of requiring 24–48GB VRAM for comfortable inference.

ModelParametersMax resolutionBest for
LLaVA-1.5-7B7B336×336General VQA, image captioning
LLaVA-1.5-13B13B336×336More complex image reasoning
LLaVA-NeXT-7B7B672×672+High-res images, documents
LLaVA-NeXT-34B34B1344×1344Dense documents, fine details

LLaVA fine-tuning on domain-specific image-text pairs requires two stages mirroring the original training: first, training only the projection layer to align domain-specific visual features with the language model's embedding space, then jointly fine-tuning the projection layer and language model on instruction-following examples. Skipping the projection pre-training stage and going directly to instruction fine-tuning produces lower-quality results, particularly when domain images have different statistical properties from the CLIP training distribution — medical imaging, satellite imagery, and microscopy are common cases where the two-stage approach matters.

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

image = Image.open(requests.get("https://example.com/image.jpg", stream=True).raw)
conversation = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "What is shown in this image?"}
]}]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(output[0], skip_special_tokens=True))

LLaVA's visual token representation uses CLIP's patch-based encoding, where a 336×336 image is divided into 14×14 patches producing 576 visual tokens. Each visual token represents a 24×24 pixel region of the image. This fixed-resolution encoding limits LLaVA-1.5's ability to process fine-grained details in high-resolution images like document scans, small text, and dense charts. LLaVA-NeXT addresses this limitation by processing high-resolution images as a grid of overlapping sub-images, each encoded independently and concatenated, enabling effective processing of images up to 1344×1344 pixels at the cost of producing proportionally more visual tokens per image.

Multi-image reasoning with LLaVA enables comparative analysis tasks that require understanding relationships between multiple images. The model can process multiple images in a single context by prepending each image's visual tokens before the corresponding reference in the text. Use cases include change detection between before-and-after image pairs, multi-view 3D understanding from different camera angles, and visual instruction following that references multiple product images. The multi-image capability is particularly valuable for e-commerce applications comparing product images or medical applications analyzing image sequences over time.

LLaVA instruction tuning quality depends heavily on the visual instruction dataset. The original LLaVA training used GPT-4 to generate instruction-following examples from image captions and bounding box annotations, producing a diverse set of question types covering description, reasoning, and detail spotting. Domain-specific LLaVA fine-tuning follows the same methodology: collect domain images, generate instruction-following pairs using a capable LLM with image context, and fine-tune on the resulting dataset. The generated instruction pairs should cover the full range of question types expected in production rather than only the most common or easy query patterns.

LLaVA's projection layer design choice — using a simple two-layer MLP rather than a cross-attention mechanism — was validated empirically to produce comparable or better performance on visual instruction following benchmarks while being computationally simpler. The MLP projection maps 256-dimensional CLIP patch embeddings to the language model's hidden dimension (typically 4096), effectively translating the visual feature space into the textual embedding space. This architectural simplicity makes LLaVA particularly accessible for academic research and custom deployment, as the projection layer can be trained on consumer hardware in hours using a modest dataset of image-caption pairs.