Qwen-VL

Qwen-VL family overview
Qwen2-VL architecture
Inference with HuggingFace
Document and chart understanding
Multi-image and video
Benchmark performance
Gotchas

SECTION 01

Qwen-VL family overview

Qwen-VL (Alibaba, August 2023) was one of the first competitive open-source visual language models. Qwen2-VL (September 2024) is a major upgrade, available in 2B, 7B, and 72B sizes. Key improvements: (1) Naive Dynamic Resolution — supports any image resolution from 256×28 to 1280×1280, processing images at their native resolution without fixed-size resizing; (2) Multimodal Rotary Position Embedding (M-RoPE) — encodes both spatial and temporal positions, enabling video understanding; (3) state-of-the-art OCR on Chinese and English text in images.

SECTION 02

Qwen2-VL architecture

Qwen2-VL uses a Vision Transformer (ViT) as the visual encoder, processing images in 14×14 pixel patches. The Naive Dynamic Resolution mechanism dynamically adjusts the number of visual tokens based on image resolution — a 224×224 image produces 256 tokens; a 1120×1120 image produces 6400 tokens. Temporal compression (2× for video) reduces video frames to manageable token counts. The visual tokens are processed by the Qwen2 language model backbone via cross-attention, with M-RoPE encoding both 2D spatial position (row, column) and temporal position (frame index) simultaneously.

SECTION 03

Inference with HuggingFace

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

model_name = "Qwen/Qwen2-VL-7B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Single image VQA
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/chart.png"},
            {"type": "text", "text": "What is the highest value in this bar chart?"},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512)

output = processor.batch_decode(
    [out[len(inp):] for out, inp in zip(generated_ids, inputs.input_ids)],
    skip_special_tokens=True,
)[0]
print(output)

SECTION 04

Document and chart understanding

Qwen2-VL excels at dense document understanding tasks:

from pathlib import Path
import base64

def analyze_document(image_path: str, question: str) -> str:
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image_path,
                    "min_pixels": 224 * 224,
                    "max_pixels": 1280 * 28 * 28,  # high res for documents
                },
                {"type": "text", "text": question},
            ],
        }
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=1024)
    return processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

# Examples
print(analyze_document("invoice.png", "Extract vendor name, total amount, and date as JSON"))
print(analyze_document("chart.png", "Summarise the key trend shown in this chart"))
print(analyze_document("contract.pdf.png", "What is the termination clause?"))

SECTION 05

Multi-image and video

# Multiple images in one conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "before.png"},
            {"type": "image", "image": "after.png"},
            {"type": "text", "text": "What changed between these two screenshots?"},
        ],
    }
]

# Video understanding
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "demo.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,          # sample 1 frame per second
            },
            {"type": "text", "text": "Describe what happens in this video"},
        ],
    }
]

SECTION 06

Benchmark performance

Qwen2-VL 72B benchmarks (vs other open VLMs, as of late 2024):

DocVQA: 96.5% — state-of-the-art among open models
ChartQA: 88.3% — leading open model for chart understanding
TextVQA: 85.5% — competitive with GPT-4V
OCRBench: 877/1000 — best open model for OCR as of 2024
MMMU: 64.5% — strong general multimodal reasoning

The 7B model achieves competitive performance on doc/chart tasks at a fraction of the compute of the 72B model — good choice for production document processing.

SECTION 07

Gotchas

qwen_vl_utils required: The process_vision_info function from the qwen_vl_utils package handles image/video loading and format conversion. Install it separately: pip install qwen-vl-utils. Without it, image processing fails silently.
max_pixels matters: Set max_pixels in the image dict to control resolution. Higher resolution = more tokens = more VRAM. For typical photos 1024×1024 is sufficient; for dense documents use 1280×1280.
Multilingual by default: Qwen2-VL responds in the language of the question. For consistent English outputs when the image may contain Chinese text, explicitly add "Reply in English" to your prompt.

Qwen-VL deployment and fine-tuning

Qwen2-VL's dynamic resolution handling allows processing images at their native resolution rather than forcing a fixed 224×224 or 448×448 crop. The model's vision encoder divides images into variable-sized patches and creates a dynamic number of visual tokens proportional to image resolution and content complexity. High-resolution inputs produce more visual tokens and enable finer-grained understanding of details like small text, dense charts, and intricate diagrams. The trade-off is increased inference latency for high-resolution inputs, requiring applications to balance image resolution against latency requirements based on task demands.

Model	Parameters	Context window	Video support
Qwen2-VL-2B	2B	32K tokens	Yes (limited)
Qwen2-VL-7B	7B	128K tokens	Yes
Qwen2-VL-72B	72B	128K tokens	Yes (extended)
QwQ-VL-72B (reasoning)	72B	128K tokens	Yes

Fine-tuning Qwen2-VL on domain-specific image-text pairs follows the same PEFT workflow as text-only Qwen2 models, using LoRA adapters applied to the language model backbone while keeping the vision encoder frozen. The most effective fine-tuning datasets pair image inputs with the exact output format required by the application — if the downstream task requires structured JSON extraction from invoice images, training examples should demonstrate the JSON extraction format directly. Mixed fine-tuning datasets that combine domain-specific examples with general instruction-following examples prevent catastrophic forgetting of the model's baseline capabilities.

Qwen2-VL's video understanding capability processes video as a sequence of sampled frames, with the model reasoning across temporal relationships between frames. The frame sampling rate can be configured based on video content type: fast-moving scenes require denser sampling (4–8 fps) while instructional or static-scene videos can be sampled more sparsely (1–2 fps) with equivalent understanding quality. The dynamic token allocation mechanism ensures that frames with more visual complexity receive more visual tokens than simple or static frames, improving overall video representation quality within context window constraints.

Document understanding with Qwen2-VL leverages the model's native high-resolution input capability to process scanned documents, PDFs rendered as images, and screenshots containing dense text. Unlike approaches that first run OCR and then process the extracted text, Qwen2-VL reasons directly from the visual representation, preserving spatial layout information that is lost in text extraction. This layout preservation is valuable for structured document understanding tasks like invoice processing, form extraction, and table interpretation, where the spatial relationship between elements (column alignment, checkbox proximity, label-value pairs) conveys semantic information not captured by sequential text extraction.

Qwen2-VL's multilingual visual understanding capability extends to text recognition and visual question answering across multiple scripts and languages, reflecting the multilingual training of the underlying Qwen2 language model. This cross-lingual capability is particularly valuable for global document processing applications where documents in different languages must be processed by a single model without language-specific preprocessing pipelines. The model handles mixed-language documents — a Chinese invoice with English product names, or a Japanese form with numerical data — more reliably than models trained primarily on English vision-language data.

Qwen2-VL's absolute position encoding for visual tokens enables consistent spatial reasoning regardless of image resolution or aspect ratio. By encoding each visual patch with its absolute position in the image grid rather than relative sequential position, the model maintains spatial awareness of where in the image each visual token originated. This position encoding approach supports accurate spatial reasoning about object locations, directional relationships, and layout understanding across images with diverse aspect ratios and content densities without requiring the input image to be normalized to a fixed canonical format.

Qwen2-VL fine-tuning with LoRA adapters applied to the language model backbone while keeping the vision encoder frozen is the most practical adaptation strategy for domain-specific visual tasks. Using LoRA rank-8 or rank-16 applied to the attention layers of the language model backbone requires only 50–200MB of adapter weights for a 7B model, with full fine-tuning of the 7B model on domain data requiring approximately 28GB VRAM. The frozen vision encoder approach is justified when domain images are sufficiently similar to the pre-training distribution that the existing visual features are adequate — specialized domains like medical imaging may benefit from also fine-tuning the vision encoder.