PaliGemma

PaliGemma architecture
PaliGemma vs PaliGemma 2
Inference with HuggingFace
Fine-tuning PaliGemma
Task-specific prompting
Benchmark performance
Gotchas

SECTION 01

PaliGemma architecture

PaliGemma (Google DeepMind, May 2024) is a 3B parameter vision-language model combining a SigLIP vision encoder (400M params, trained with sigmoid loss instead of contrastive loss) with a Gemma 2B language model. The vision encoder processes images at 224×224 resolution, producing 256 image tokens. These are linearly projected and prepended to the text tokens before the Gemma language model processes them. Total trainable: 3B parameters.

Unlike GPT-4V which is designed for general conversation, PaliGemma is designed as a transfer model: pre-trained on a broad mixture of vision-language tasks, then fine-tuned on specific downstream tasks. It excels at tasks with clear input-output structure: captioning, VQA, document reading, and structured prediction.

SECTION 02

PaliGemma vs PaliGemma 2

PaliGemma (v1): 3B params, 224×224 input, strong on captioning and VQA. Widely used for fine-tuning. PaliGemma 2 (December 2024): Three sizes — 3B, 10B, 28B. Uses Gemma 2 backbone (improved quality). Supports 224×448 and 448×448 resolutions for document tasks. Significantly better on DocVQA, TextVQA, and OCR benchmarks. For new projects, prefer PaliGemma 2. For fine-tuning on a tight budget, PaliGemma v1 3B is more accessible.

SECTION 03

Inference with HuggingFace

from transformers import PaliGemmaForConditionalGeneration, AutoProcessor
from PIL import Image
import torch, requests

model_id = "google/paligemma-3b-mix-224"
# Or PaliGemma 2: "google/paligemma2-3b-pt-224"

processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

image = Image.open("chart.png")

# PaliGemma uses task-specific prompt prefixes
prompt = "caption en"  # or "answer en What is shown in this image?"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
    )

# Decode only the new tokens
generated_text = processor.decode(
    generated_ids[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(generated_text)

SECTION 04

Fine-tuning PaliGemma

from transformers import PaliGemmaForConditionalGeneration, AutoProcessor, TrainingArguments, Trainer
from datasets import Dataset
from peft import LoraConfig, get_peft_model
import torch

model_id = "google/paligemma-3b-pt-224"  # use pretrained (pt) for fine-tuning
processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16
)

# LoRA config for efficient fine-tuning
lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = get_peft_model(model, lora_config)

def preprocess(examples):
    images = [Image.open(p) for p in examples["image_path"]]
    inputs = processor(
        text=examples["prompt"],
        images=images,
        suffix=examples["label"],  # target text
        return_tensors="pt",
        padding=True,
    )
    inputs["labels"] = inputs["input_ids"].clone()
    # Mask non-label tokens in loss
    inputs["labels"][inputs["token_type_ids"] == 0] = -100
    return inputs

# Dataset: {"image_path": [...], "prompt": ["answer en ..."], "label": [...]}
training_args = TrainingArguments(
    output_dir="./paligemma-finetuned",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    bf16=True,
    dataloader_num_workers=4,
)

SECTION 05

Task-specific prompting

PaliGemma uses structured prompt prefixes that tell it what task to perform:

task_prompts = {
    "captioning": "caption en",
    "vqa": "answer en {question}",
    "ocr": "ocr",
    "detection": "detect {object}",          # returns bounding boxes
    "segmentation": "segment {object}",      # returns segmentation mask
    "document_qa": "answer en {question}",   # for DocVQA
    "reference_expression": "refer {description}",  # locate described region
}

# Examples:
prompts = [
    ("caption en", "→ A bar chart showing quarterly revenue growth"),
    ("answer en What company is shown?", "→ Acme Corp"),
    ("ocr", "→ Invoice #12345 Total: $1,234.56"),
    ("detect car", "→ "),  # location tokens
]

SECTION 06

Benchmark performance

PaliGemma 3B zero-shot (pretrained) vs fine-tuned on specific benchmarks:

COCO captioning (CIDEr): 144 (pretrained) → 156 (fine-tuned) — competitive with larger models
VQAv2: 83.1% (pretrained), 86.0% (fine-tuned)
DocVQA: 52.3% (pretrained), 84.8% (fine-tuned) — massive fine-tuning gain on document tasks
TextVQA: 55.1% (pretrained), 73.1% (fine-tuned)

The large gap between pretrained and fine-tuned on DocVQA highlights PaliGemma's design philosophy: it's a transfer model, not a zero-shot model. For document-specific tasks, fine-tune with even 100–500 examples and see dramatic improvements.

SECTION 07

Gotchas

Use pretrained (pt) for fine-tuning: HuggingFace hosts both pretrained (paligemma-3b-pt-224) and fine-tuned mix (paligemma-3b-mix-224) checkpoints. For fine-tuning on a new task, start from the pretrained checkpoint — the mix checkpoint has already been fine-tuned and may interfere.
Resolution matters for documents: At 224×224, text in dense documents is unreadable. Use PaliGemma 2 at 448×448 for document tasks, or tile the image manually before passing to the 224×224 model.
Task prefix must match training: PaliGemma was trained with specific task prefix strings. Using the wrong prefix or omitting it produces poor results. Check the task prefix table in the paper appendix for your task.

PaliGemma deployment and use cases

PaliGemma's instruction format follows a specific structure where the task prefix determines the model's behavior. Prefixes like "caption", "describe", "ocr", "detect", and "answer" activate different task heads trained during the mix-fine-tuning stage. Using the wrong prefix for a task — for instance, using "describe" when "ocr" is appropriate for text extraction — produces qualitatively different and typically lower-quality outputs because the model activates the wrong task pathway. Reading the task-specific prefix documentation before deployment prevents this common misconfiguration.

Task prefix	Output type	Typical use case
caption	Short natural language description	Image alt text, search indexing
describe	Detailed description	Accessibility, image understanding
ocr	Text extracted from image	Document digitization, form reading
detect <object>	Bounding box coordinates	Object detection, spatial reasoning
answer <question>	Direct answer string	Visual QA, image-based queries

PaliGemma's compact size makes it the most practical open-weight VLM for on-device and edge deployment scenarios. The 3B variant runs inference at interactive speeds on consumer GPUs (RTX 3060 and above) and can be quantized to 4-bit using GPTQ or AWQ with minimal quality degradation on most vision tasks. For mobile deployment using ONNX Runtime or TensorFlow Lite, the quantized 3B model fits within the memory constraints of high-end mobile devices, enabling on-device image understanding without API calls.

PaliGemma's mixture fine-tuning approach trained the model jointly on diverse vision-language tasks including image captioning, visual question answering, referring expression comprehension, document understanding, and image segmentation. This multi-task training produces a model that transfers effectively to new tasks with minimal additional fine-tuning, because the shared representation space has been organized to support a wide range of visual reasoning patterns. Domain-specific fine-tuning starting from PaliGemma consistently outperforms fine-tuning from CLIP + frozen language model baselines, because PaliGemma's pre-training has already aligned visual and language representations for structured task completion.

Vision-language alignment in PaliGemma uses a linear projection layer that maps SigLIP visual embeddings into the PaliGemma language model's embedding space. Unlike more complex cross-attention mechanisms, this simple linear projection is computationally efficient and produces competitive alignment quality. The projection layer is the most task-sensitive component of the model — fine-tuning the projection layer alongside the language model on domain-specific image-text pairs is the minimum intervention needed to adapt PaliGemma to a new visual domain, while keeping the SigLIP backbone frozen preserves the general visual feature extraction capability.

PaliGemma's segmentation capability, inherited from PaLI's detection pre-training, enables referring expression segmentation — identifying the region of an image described by a natural language phrase. This capability is exposed through the "segment" task prefix and returns segmentation mask coordinates rather than bounding boxes. For applications requiring precise object boundary delineation — medical image analysis, quality inspection, satellite image annotation — PaliGemma's built-in segmentation eliminates the need for a separate specialized segmentation model in the pipeline.

PaliGemma's SigLIP vision encoder was trained using sigmoid loss rather than the contrastive softmax loss used by CLIP, producing a vision encoder with improved calibration for fine-grained visual tasks. The sigmoid loss treats each image-text pair independently rather than contrasting against all other pairs in the batch, which reduces the dependence on large batch sizes for training quality. This training difference produces SigLIP encoders with higher linear probing accuracy on fine-grained recognition benchmarks compared to CLIP encoders of equivalent size, explaining part of PaliGemma's strong performance on visual reasoning tasks.

PaliGemma's 224px and 448px input resolution variants offer a latency-quality tradeoff for deployment. The 224px variant processes images roughly 4x faster than the 448px variant due to the quadratic scaling of visual token count with resolution, making it suitable for real-time applications. The 448px variant captures finer image details and consistently outperforms the 224px variant on tasks requiring text reading, small object recognition, and fine-grained categorization. Selecting the input resolution based on the smallest resolution at which task quality is acceptable minimizes inference cost while meeting accuracy requirements.