Google's compact VLM combining SigLIP vision encoder with Gemma language model. Optimised for fine-tuning on downstream tasks: captioning, DocVQA, OCR, segmentation, and detection.
PaliGemma (Google DeepMind, May 2024) is a 3B parameter vision-language model combining a SigLIP vision encoder (400M params, trained with sigmoid loss instead of contrastive loss) with a Gemma 2B language model. The vision encoder processes images at 224×224 resolution, producing 256 image tokens. These are linearly projected and prepended to the text tokens before the Gemma language model processes them. Total trainable: 3B parameters.
Unlike GPT-4V which is designed for general conversation, PaliGemma is designed as a transfer model: pre-trained on a broad mixture of vision-language tasks, then fine-tuned on specific downstream tasks. It excels at tasks with clear input-output structure: captioning, VQA, document reading, and structured prediction.
PaliGemma (v1): 3B params, 224×224 input, strong on captioning and VQA. Widely used for fine-tuning. PaliGemma 2 (December 2024): Three sizes — 3B, 10B, 28B. Uses Gemma 2 backbone (improved quality). Supports 224×448 and 448×448 resolutions for document tasks. Significantly better on DocVQA, TextVQA, and OCR benchmarks. For new projects, prefer PaliGemma 2. For fine-tuning on a tight budget, PaliGemma v1 3B is more accessible.
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor
from PIL import Image
import torch, requests
model_id = "google/paligemma-3b-mix-224"
# Or PaliGemma 2: "google/paligemma2-3b-pt-224"
processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
image = Image.open("chart.png")
# PaliGemma uses task-specific prompt prefixes
prompt = "caption en" # or "answer en What is shown in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
)
# Decode only the new tokens
generated_text = processor.decode(
generated_ids[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
print(generated_text)
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor, TrainingArguments, Trainer
from datasets import Dataset
from peft import LoraConfig, get_peft_model
import torch
model_id = "google/paligemma-3b-pt-224" # use pretrained (pt) for fine-tuning
processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16
)
# LoRA config for efficient fine-tuning
lora_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = get_peft_model(model, lora_config)
def preprocess(examples):
images = [Image.open(p) for p in examples["image_path"]]
inputs = processor(
text=examples["prompt"],
images=images,
suffix=examples["label"], # target text
return_tensors="pt",
padding=True,
)
inputs["labels"] = inputs["input_ids"].clone()
# Mask non-label tokens in loss
inputs["labels"][inputs["token_type_ids"] == 0] = -100
return inputs
# Dataset: {"image_path": [...], "prompt": ["answer en ..."], "label": [...]}
training_args = TrainingArguments(
output_dir="./paligemma-finetuned",
num_train_epochs=5,
per_device_train_batch_size=4,
learning_rate=2e-5,
bf16=True,
dataloader_num_workers=4,
)
PaliGemma uses structured prompt prefixes that tell it what task to perform:
task_prompts = {
"captioning": "caption en",
"vqa": "answer en {question}",
"ocr": "ocr",
"detection": "detect {object}", # returns bounding boxes
"segmentation": "segment {object}", # returns segmentation mask
"document_qa": "answer en {question}", # for DocVQA
"reference_expression": "refer {description}", # locate described region
}
# Examples:
prompts = [
("caption en", "→ A bar chart showing quarterly revenue growth"),
("answer en What company is shown?", "→ Acme Corp"),
("ocr", "→ Invoice #12345 Total: $1,234.56"),
("detect car", "→ "), # location tokens
]
PaliGemma 3B zero-shot (pretrained) vs fine-tuned on specific benchmarks:
The large gap between pretrained and fine-tuned on DocVQA highlights PaliGemma's design philosophy: it's a transfer model, not a zero-shot model. For document-specific tasks, fine-tune with even 100–500 examples and see dramatic improvements.
paligemma-3b-pt-224) and fine-tuned mix (paligemma-3b-mix-224) checkpoints. For fine-tuning on a new task, start from the pretrained checkpoint — the mix checkpoint has already been fine-tuned and may interfere.PaliGemma's instruction format follows a specific structure where the task prefix determines the model's behavior. Prefixes like "caption", "describe", "ocr", "detect", and "answer" activate different task heads trained during the mix-fine-tuning stage. Using the wrong prefix for a task — for instance, using "describe" when "ocr" is appropriate for text extraction — produces qualitatively different and typically lower-quality outputs because the model activates the wrong task pathway. Reading the task-specific prefix documentation before deployment prevents this common misconfiguration.
| Task prefix | Output type | Typical use case |
|---|---|---|
| caption | Short natural language description | Image alt text, search indexing |
| describe | Detailed description | Accessibility, image understanding |
| ocr | Text extracted from image | Document digitization, form reading |
| detect <object> | Bounding box coordinates | Object detection, spatial reasoning |
| answer <question> | Direct answer string | Visual QA, image-based queries |
PaliGemma's compact size makes it the most practical open-weight VLM for on-device and edge deployment scenarios. The 3B variant runs inference at interactive speeds on consumer GPUs (RTX 3060 and above) and can be quantized to 4-bit using GPTQ or AWQ with minimal quality degradation on most vision tasks. For mobile deployment using ONNX Runtime or TensorFlow Lite, the quantized 3B model fits within the memory constraints of high-end mobile devices, enabling on-device image understanding without API calls.
PaliGemma's mixture fine-tuning approach trained the model jointly on diverse vision-language tasks including image captioning, visual question answering, referring expression comprehension, document understanding, and image segmentation. This multi-task training produces a model that transfers effectively to new tasks with minimal additional fine-tuning, because the shared representation space has been organized to support a wide range of visual reasoning patterns. Domain-specific fine-tuning starting from PaliGemma consistently outperforms fine-tuning from CLIP + frozen language model baselines, because PaliGemma's pre-training has already aligned visual and language representations for structured task completion.
Vision-language alignment in PaliGemma uses a linear projection layer that maps SigLIP visual embeddings into the PaliGemma language model's embedding space. Unlike more complex cross-attention mechanisms, this simple linear projection is computationally efficient and produces competitive alignment quality. The projection layer is the most task-sensitive component of the model — fine-tuning the projection layer alongside the language model on domain-specific image-text pairs is the minimum intervention needed to adapt PaliGemma to a new visual domain, while keeping the SigLIP backbone frozen preserves the general visual feature extraction capability.
PaliGemma's segmentation capability, inherited from PaLI's detection pre-training, enables referring expression segmentation — identifying the region of an image described by a natural language phrase. This capability is exposed through the "segment" task prefix and returns segmentation mask coordinates rather than bounding boxes. For applications requiring precise object boundary delineation — medical image analysis, quality inspection, satellite image annotation — PaliGemma's built-in segmentation eliminates the need for a separate specialized segmentation model in the pipeline.
PaliGemma's SigLIP vision encoder was trained using sigmoid loss rather than the contrastive softmax loss used by CLIP, producing a vision encoder with improved calibration for fine-grained visual tasks. The sigmoid loss treats each image-text pair independently rather than contrasting against all other pairs in the batch, which reduces the dependence on large batch sizes for training quality. This training difference produces SigLIP encoders with higher linear probing accuracy on fine-grained recognition benchmarks compared to CLIP encoders of equivalent size, explaining part of PaliGemma's strong performance on visual reasoning tasks.
PaliGemma's 224px and 448px input resolution variants offer a latency-quality tradeoff for deployment. The 224px variant processes images roughly 4x faster than the 448px variant due to the quadratic scaling of visual token count with resolution, making it suitable for real-time applications. The 448px variant captures finer image details and consistently outperforms the 224px variant on tasks requiring text reading, small object recognition, and fine-grained categorization. Selecting the input resolution based on the smallest resolution at which task quality is acceptable minimizes inference cost while meeting accuracy requirements.