GPT-4V (Vision) and GPT-4o are OpenAI's multimodal models that accept images as input. From document parsing to visual QA to screenshot-to-code, vision models unlock use cases impossible with text-only LLMs.
GPT-4V (GPT-4 with Vision, released November 2023) and GPT-4o (May 2024) accept images as part of the input prompt. The model processes images and text together in a unified transformer — no separate image-to-text step. This enables use cases impossible with text-only LLMs:
import openai
import base64
from pathlib import Path
client = openai.OpenAI()
# Method 1: URL (for publicly accessible images)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}},
],
}],
max_tokens=500,
)
print(response.choices[0].message.content)
# Method 2: Base64 (for local images, private images)
def encode_image(path: str) -> str:
return base64.b64encode(Path(path).read_bytes()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text from this document."},
{"type": "image_url", "image_url": {
"url": f"data:image/jpeg;base64,{encode_image('invoice.jpg')}",
}},
],
}],
)
import openai, base64, json
client = openai.OpenAI()
def extract_invoice_data(image_path: str) -> dict:
img_b64 = base64.b64encode(open(image_path, "rb").read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract the following from this invoice as JSON: {vendor_name, invoice_number, invoice_date (YYYY-MM-DD), total_amount, currency, line_items[{description, quantity, unit_price, total}]}. Return only valid JSON, no explanation."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
],
}],
response_format={"type": "json_object"}, # ensures JSON output
)
return json.loads(response.choices[0].message.content)
data = extract_invoice_data("invoice.pdf")
print(f"Total: {data['currency']} {data['total_amount']}")
import openai, base64
client = openai.OpenAI()
def screenshot_to_react(screenshot_path: str) -> str:
img_b64 = base64.b64encode(open(screenshot_path, "rb").read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Convert this UI screenshot to a React component using Tailwind CSS. Match the layout closely. Use functional TypeScript. Include all visible text. Make it responsive. Return only the component code, no explanation."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
],
}],
max_tokens=2000,
)
return response.choices[0].message.content
react_code = screenshot_to_react("dashboard_mockup.png")
print(react_code)
GPT-4o vision has three detail levels that control quality and cost:
# Explicit detail control
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Is this a cat or a dog?"},
{"type": "image_url", "image_url": {
"url": image_url,
"detail": "low", # 85 tokens, enough for simple classification
}},
],
}],
)
For high-volume pipelines, use detail: "low" for classification tasks and detail: "high" only when you need to read text or see fine details.
import openai
from pydantic import BaseModel
from typing import List
client = openai.OpenAI()
class TableRow(BaseModel):
column1: str
column2: str
column3: float
class TableData(BaseModel):
headers: List[str]
rows: List[TableRow]
# Use structured output to extract a table
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract the table from this image into structured data."},
{"type": "image_url", "image_url": {"url": image_url, "detail": "high"}},
],
}],
response_format=TableData, # pydantic model → auto JSON schema
)
table = response.choices[0].message.parsed
for row in table.rows:
print(f"{row.column1}, {row.column2}, {row.column3}")
Text in images: Use detail: "high" for any image containing text you need to read. At low detail, small text becomes illegible to the model.
PDF → images: GPT-4o doesn't accept PDF files directly via the API. Convert PDF pages to images first: pdf2image.convert_from_path("doc.pdf"). Then send each page as a separate image.
Multiple images per request: You can send up to 100 images in a single request. Useful for multi-page documents, but watch context window usage — each high-detail image can be 765+ tokens.
Not all images are equal: Very low resolution images (<100px in any dimension), heavily compressed JPEGs, and images with complex overlapping text can still confuse the model. Pre-process with image enhancement if OCR quality is poor.
Privacy: Images sent to the API are processed by OpenAI's servers. For sensitive documents (medical, financial, legal), review OpenAI's data processing agreement before using vision APIs.
GPT-4V (Vision) extends GPT-4 with the ability to accept images as part of the conversation context, enabling applications that combine visual understanding with language reasoning. The model can analyze photographs, diagrams, charts, screenshots, documents, and handwritten content, integrating visual observations with textual reasoning in a single unified response.
| Task Category | Capability Level | Notes |
|---|---|---|
| Object recognition | Excellent | Wide object vocabulary |
| Chart / graph reading | Very good | Occasional numeric errors on dense charts |
| OCR / text extraction | Good | Degrades on low-res or stylized fonts |
| Spatial reasoning | Moderate | Left/right confusions on complex scenes |
| Face identification | Refused | Privacy policy restriction |
| Mathematical diagrams | Good | Geometry better than abstract algebra |
Image resolution handling in GPT-4V uses a tile-based approach for high-resolution inputs. Images above 512×512 pixels are automatically tiled into 512×512 tiles that are each processed by the vision encoder, with a low-resolution overview image also provided. This tile-based processing allows fine detail analysis of large images but significantly increases token consumption — a 2048×2048 image is processed as 16 tiles plus an overview, consuming approximately 1,700 tokens before the text prompt begins. Resizing images to the minimum resolution needed for the task dramatically reduces cost without quality loss for tasks that don't require fine detail.
Multi-image inputs in GPT-4V enable comparison, before/after analysis, and document multi-page processing within a single conversation turn. Images are interleaved with text in the message content array, allowing precisely positioned references like "compare the chart on the left with the chart on the right." The total number of images per request is limited by the maximum token budget, so long conversations with many images require careful management of which images remain in the context window versus are evicted to stay within limits.
Structured data extraction from document images is one of the highest-ROI applications of GPT-4V. Invoices, receipts, forms, and tables photographed or scanned as images can be passed directly to GPT-4V with a prompt requesting structured JSON output containing the extracted fields. This eliminates the preprocessing pipeline required by traditional OCR + parsing approaches — no need for a separate OCR service, layout analysis, or field-specific regex patterns. For documents with consistent structure, GPT-4V extraction accuracy often matches or exceeds specialized OCR pipelines while requiring far less engineering effort.
Video understanding with GPT-4V requires frame sampling since the API accepts only static images. Common strategies include uniform sampling (one frame per second or per N seconds), keyframe extraction (selecting frames where scene content changes significantly), and transcript-aligned sampling (selecting frames at timestamps corresponding to key moments in the audio transcription). For most video analysis tasks, 5–20 well-chosen frames capture sufficient visual information while keeping the token cost manageable. Dense sampling of every frame is rarely necessary and quickly exhausts the context window for longer videos.
GPT-4V accessibility applications use visual description to assist users with visual impairments. The model can describe complex images, read text from photos, navigate user interfaces from screenshots, and explain charts and diagrams in natural language. For these accessibility use cases, response format matters significantly — structured descriptions that proceed from the most important information to supporting details serve users better than comprehensive inventories of all visible elements. Few-shot prompting with examples of high-quality image descriptions helps the model adopt an appropriate level of detail and descriptive style for accessibility contexts.