GPT-4V / 4o

What vision models enable
Sending images via the API
Document and table extraction
Screenshot to code
Image detail levels and cost
OCR and structured extraction
Gotchas

SECTION 01

What vision models enable

GPT-4V (GPT-4 with Vision, released November 2023) and GPT-4o (May 2024) accept images as part of the input prompt. The model processes images and text together in a unified transformer — no separate image-to-text step. This enables use cases impossible with text-only LLMs:

Document understanding: Read and extract data from PDFs, scanned forms, invoices, and receipts — including tables and handwriting.
Screenshot analysis: Describe UI screenshots, extract form fields, convert mockups to HTML/CSS code.
Visual QA: Answer questions about charts, diagrams, photos, and technical drawings.
Code from screenshots: Upload a UI screenshot and generate the React/HTML implementation.
Medical imaging: Describe X-rays, pathology slides, and medical charts (with appropriate disclaimers).

SECTION 02

Sending images via the API

import openai
import base64
from pathlib import Path

client = openai.OpenAI()

# Method 1: URL (for publicly accessible images)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}},
        ],
    }],
    max_tokens=500,
)
print(response.choices[0].message.content)

# Method 2: Base64 (for local images, private images)
def encode_image(path: str) -> str:
    return base64.b64encode(Path(path).read_bytes()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all text from this document."},
            {"type": "image_url", "image_url": {
                "url": f"data:image/jpeg;base64,{encode_image('invoice.jpg')}",
            }},
        ],
    }],
)

SECTION 03

Document and table extraction

import openai, base64, json

client = openai.OpenAI()

def extract_invoice_data(image_path: str) -> dict:
    img_b64 = base64.b64encode(open(image_path, "rb").read()).decode()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract the following from this invoice as JSON: {vendor_name, invoice_number, invoice_date (YYYY-MM-DD), total_amount, currency, line_items[{description, quantity, unit_price, total}]}. Return only valid JSON, no explanation."},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            ],
        }],
        response_format={"type": "json_object"},  # ensures JSON output
    )
    return json.loads(response.choices[0].message.content)

data = extract_invoice_data("invoice.pdf")
print(f"Total: {data['currency']} {data['total_amount']}")

SECTION 04

Screenshot to code

import openai, base64

client = openai.OpenAI()

def screenshot_to_react(screenshot_path: str) -> str:
    img_b64 = base64.b64encode(open(screenshot_path, "rb").read()).decode()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Convert this UI screenshot to a React component using Tailwind CSS. Match the layout closely. Use functional TypeScript. Include all visible text. Make it responsive. Return only the component code, no explanation."},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            ],
        }],
        max_tokens=2000,
    )
    return response.choices[0].message.content

react_code = screenshot_to_react("dashboard_mockup.png")
print(react_code)

SECTION 05

Image detail levels and cost

GPT-4o vision has three detail levels that control quality and cost:

low: Always uses 85 tokens regardless of image size. Resizes image to 512×512. Best for: simple visual questions where fine detail isn't needed.
high: Splits the image into 512×512 tiles. Each tile costs 170 tokens + 85 base. A 1024×1024 image → 4 tiles + base = 765 tokens. Best for: document reading, fine text, detailed diagrams.
auto (default): Selects low for small images, high for large ones.

# Explicit detail control
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Is this a cat or a dog?"},
            {"type": "image_url", "image_url": {
                "url": image_url,
                "detail": "low",  # 85 tokens, enough for simple classification
            }},
        ],
    }],
)

For high-volume pipelines, use detail: "low" for classification tasks and detail: "high" only when you need to read text or see fine details.

SECTION 06

OCR and structured extraction

import openai
from pydantic import BaseModel
from typing import List

client = openai.OpenAI()

class TableRow(BaseModel):
    column1: str
    column2: str
    column3: float

class TableData(BaseModel):
    headers: List[str]
    rows: List[TableRow]

# Use structured output to extract a table
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the table from this image into structured data."},
            {"type": "image_url", "image_url": {"url": image_url, "detail": "high"}},
        ],
    }],
    response_format=TableData,  # pydantic model → auto JSON schema
)

table = response.choices[0].message.parsed
for row in table.rows:
    print(f"{row.column1}, {row.column2}, {row.column3}")

SECTION 07

Gotchas

Text in images: Use detail: "high" for any image containing text you need to read. At low detail, small text becomes illegible to the model.

PDF → images: GPT-4o doesn't accept PDF files directly via the API. Convert PDF pages to images first: pdf2image.convert_from_path("doc.pdf"). Then send each page as a separate image.

Multiple images per request: You can send up to 100 images in a single request. Useful for multi-page documents, but watch context window usage — each high-detail image can be 765+ tokens.

Not all images are equal: Very low resolution images (<100px in any dimension), heavily compressed JPEGs, and images with complex overlapping text can still confuse the model. Pre-process with image enhancement if OCR quality is poor.

Privacy: Images sent to the API are processed by OpenAI's servers. For sensitive documents (medical, financial, legal), review OpenAI's data processing agreement before using vision APIs.

GPT-4V Capability and Limitation Reference

GPT-4V (Vision) extends GPT-4 with the ability to accept images as part of the conversation context, enabling applications that combine visual understanding with language reasoning. The model can analyze photographs, diagrams, charts, screenshots, documents, and handwritten content, integrating visual observations with textual reasoning in a single unified response.

Task Category	Capability Level	Notes
Object recognition	Excellent	Wide object vocabulary
Chart / graph reading	Very good	Occasional numeric errors on dense charts
OCR / text extraction	Good	Degrades on low-res or stylized fonts
Spatial reasoning	Moderate	Left/right confusions on complex scenes
Face identification	Refused	Privacy policy restriction
Mathematical diagrams	Good	Geometry better than abstract algebra

Image resolution handling in GPT-4V uses a tile-based approach for high-resolution inputs. Images above 512×512 pixels are automatically tiled into 512×512 tiles that are each processed by the vision encoder, with a low-resolution overview image also provided. This tile-based processing allows fine detail analysis of large images but significantly increases token consumption — a 2048×2048 image is processed as 16 tiles plus an overview, consuming approximately 1,700 tokens before the text prompt begins. Resizing images to the minimum resolution needed for the task dramatically reduces cost without quality loss for tasks that don't require fine detail.

Multi-image inputs in GPT-4V enable comparison, before/after analysis, and document multi-page processing within a single conversation turn. Images are interleaved with text in the message content array, allowing precisely positioned references like "compare the chart on the left with the chart on the right." The total number of images per request is limited by the maximum token budget, so long conversations with many images require careful management of which images remain in the context window versus are evicted to stay within limits.

Structured data extraction from document images is one of the highest-ROI applications of GPT-4V. Invoices, receipts, forms, and tables photographed or scanned as images can be passed directly to GPT-4V with a prompt requesting structured JSON output containing the extracted fields. This eliminates the preprocessing pipeline required by traditional OCR + parsing approaches — no need for a separate OCR service, layout analysis, or field-specific regex patterns. For documents with consistent structure, GPT-4V extraction accuracy often matches or exceeds specialized OCR pipelines while requiring far less engineering effort.

Video understanding with GPT-4V requires frame sampling since the API accepts only static images. Common strategies include uniform sampling (one frame per second or per N seconds), keyframe extraction (selecting frames where scene content changes significantly), and transcript-aligned sampling (selecting frames at timestamps corresponding to key moments in the audio transcription). For most video analysis tasks, 5–20 well-chosen frames capture sufficient visual information while keeping the token cost manageable. Dense sampling of every frame is rarely necessary and quickly exhausts the context window for longer videos.

GPT-4V accessibility applications use visual description to assist users with visual impairments. The model can describe complex images, read text from photos, navigate user interfaces from screenshots, and explain charts and diagrams in natural language. For these accessibility use cases, response format matters significantly — structured descriptions that proceed from the most important information to supporting details serve users better than comprehensive inventories of all visible elements. Few-shot prompting with examples of high-quality image descriptions helps the model adopt an appropriate level of detail and descriptive style for accessibility contexts.