01 — Foundation
What Multimodal Means
Multimodal AI processes information from multiple data formats simultaneously: images, text, audio, video, and numerical data. Traditional LLMs understand only text. Multimodal models unify different modalities into a shared embedding space, allowing a single model to reason across vision, language, and sound.
The core innovation is cross-modal alignment: embedding different modalities (image and text, speech and text) into the same vector space so similar concepts have similar representations regardless of modality. This enables image-to-text retrieval, visual question answering, captioning, and more.
Two Directions: Understanding vs. Generation
| Direction | Input | Output | Models |
| Understanding | Image / audio / video | Text / embedding | GPT-4V, LLaVA, CLIP, Whisper |
| Generation | Text (prompt) | Image / audio / video | DALL-E 3, Stable Diffusion, Flux, TTS |
💡
Key distinction: Multimodal means one model handles multiple inputs. Cross-modal means aligning different modalities into the same space. Unified foundation models can understand and generate across all modalities.
02 — Understanding
Vision-Language Models
Vision-language models (VLMs) analyze images and answer questions about them. They combine a vision encoder (e.g., ViT) with an LLM decoder to output natural language descriptions, answers, or reasoning. Modern VLMs achieve remarkable performance on image understanding, visual reasoning, document parsing, and diagram analysis.
Key Models
1
GPT-4V — Most capable
OpenAI's multimodal model handles image input with strong visual reasoning, OCR, and chart interpretation.
- Integrates with ChatGPT API
- Excellent at document analysis, scene understanding, visual math
- Context window up to 128K tokens
- Highest cost but strongest performance
2
LLaVA — Open-source alternative
Large Language and Vision Assistant; efficient, open-weight model for image understanding. Trained on 665K image-text pairs.
- Free, run locally or via Replicate/HuggingFace
- Good for visual Q&A, captioning, scene understanding
- Faster inference than GPT-4V
- Smaller model, weaker on complex reasoning
3
PaliGemma — Google's lightweight
Efficient image understanding with dense captioning, object localization, and VQA. Fast inference on consumer hardware.
- Based on Gemma 2B architecture
- Strong on dense tasks (OCR, region understanding)
- Low latency, suitable for real-time applications
- Open weights, customizable
4
Qwen-VL — High resolution
Alibaba's vision-language model optimized for high-resolution images and dense text understanding.
- Handles 1280×1280 images natively
- Strong on document understanding and OCR
- Supports multiple languages
- Competitive with GPT-4V on many benchmarks
✓
When to use VLMs: Document extraction, visual Q&A, image captioning, accessibility (alt text generation), scene understanding, diagramming, architectural analysis.
03 — Generation
Image Generation
Diffusion models have become the backbone of modern image generation. Instead of directly generating pixels, they iteratively refine noise into coherent images guided by text prompts. This approach is more stable and controllable than older GAN-based methods.
Diffusion Model Pipeline
Diffusion adds noise to an image step-by-step, then trains a neural network to reverse this process. At generation time, the model starts with pure noise and iteratively denoises guided by a text prompt (via CLIP embeddings or cross-attention). More denoising steps → higher quality but slower; fewer steps → faster but lower quality.
Key Models
⚠️
Image generation cost: DALL-E 3 and Midjourney are API-based and carry per-image costs. Stable Diffusion and Flux are free to run locally but require VRAM (24GB+ for quality). Choose based on budget and latency requirements.
04 — Speech & Sound
Audio Models
Multimodal AI extends to sound: speech-to-text (ASR), text-to-speech (TTS), and audio understanding. These enable voice interfaces, accessibility, content creation, and speech analysis at scale.
Speech Recognition (ASR)
Whisper (OpenAI) is the gold standard: robust, multilingual, and open-source. Trained on 680K hours of multilingual audio from the web. Fast inference, handles accents and background noise well. Use for transcription, subtitles, voice command interfaces.
Text-to-Speech (TTS)
✓
Audio workflows: Transcribe user speech → send to LLM → generate response → convert back to speech. This creates natural voice interfaces without needing separate dialogue models.
05 — Implementation
Working Code Examples
1. Vision: Analyze Image with Claude
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def analyze_image(image_path: str, question: str) -> str:
"""Analyze an image and answer a question about it."""
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
resp = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data,
},
},
{"type": "text", "text": question}
],
}]
)
return resp.content[0].text
# Usage
answer = analyze_image("photo.jpg", "What is in this image?")
print(answer)
2. Audio: Transcribe with Whisper
import whisper
def transcribe_audio(audio_path: str) -> str:
"""Transcribe audio file to text using Whisper."""
model = whisper.load_model("base")
result = model.transcribe(audio_path)
return result["text"]
# Usage
text = transcribe_audio("speech.mp3")
print(f"Transcription: {text}")
3. Image Generation with Stable Diffusion
from diffusers import StableDiffusionPipeline
import torch
def generate_image(prompt: str, output_path: str):
"""Generate an image from a text prompt."""
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
image = pipe(prompt).images[0]
image.save(output_path)
print(f"Image saved to {output_path}")
# Usage
generate_image("A futuristic city at sunset", "city.png")
4. Complete Multimodal Pipeline
# Full workflow: speech → transcribe → analyze image → generate response → speak
# Step 1: User speaks
# audio_input = record_audio()
# Step 2: Transcribe speech
user_query = transcribe_audio("user_audio.wav")
# Step 3: Analyze image with query
answer = analyze_image("photo.jpg", user_query)
# Step 4: Generate speech from answer
from openai import OpenAI
client = OpenAI()
response = client.audio.speech.create(
model="tts-1",
voice="nova",
input=answer
)
response.stream_to_file("response.wav")
print(f"User asked: {user_query}")
print(f"Answer: {answer}")
💡
Best practice: Start with a single modality (e.g., image analysis), test thoroughly, then layer in others. Multimodal pipelines compound latency — monitor end-to-end performance.
06 — Child Concepts
What to Explore Next
Multimodal AI is vast. Dive deeper into specific modalities and tasks:
1
Vision-Language Models → concepts/vision-language.html
Deep dive into how VLMs work, model architectures (ViT + LLM fusion), fine-tuning on custom tasks, and building image-understanding applications.
2
Image Generation → concepts/image-gen.html
Diffusion model mechanics, latent-space generation, prompt engineering for image generation, fine-tuning on custom styles, and production deployment.
3
Audio Models → concepts/audio-models.html
ASR systems (Whisper internals), TTS architectures, voice cloning, emotional prosody, and building conversational voice interfaces.
4
Video Models → concepts/video-models.html
Video generation (Sora, Runway), temporal reasoning, frame interpolation, and working with long-context video understanding.
07 — Further Reading
References
Academic Papers
-
Paper
Radford, A. et al. (2021).
Learning Transferable Visual Models From Natural Language Supervision (CLIP).
arXiv:2103.00020. —
arxiv:2103.00020 ↗
-
Paper
Liu, H. et al. (2023).
LLaVA: Large Language and Vision Assistant.
arXiv:2304.08485. —
arxiv:2304.08485 ↗
-
Paper
Rombach, R. et al. (2022).
High-Resolution Image Synthesis with Latent Diffusion Models.
arXiv:2112.10752. —
arxiv:2112.10752 ↗
-
Paper
Radford, A. et al. (2022).
Robust Speech Recognition via Large-Scale Weak Supervision (Whisper).
arXiv:2212.04356. —
arxiv:2212.04356 ↗
-
Paper
(2023).
Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
arXiv:2309.07915. —
arxiv:2309.07915 ↗
-
Paper
Dosovitskiy, A. et al. (2020).
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformers).
arXiv:2010.11929. —
arxiv:2010.11929 ↗
-
Blog
OpenAI. (2023).
GPT-4V System Card. —
openai.com ↗
8
Multimodal Evaluation and Benchmarks
Evaluating multimodal models requires benchmarks that test both individual modalities and cross-modal understanding. No single benchmark captures everything — use a suite covering perception, reasoning, and generation quality.
| Benchmark | Tests | 2024 SOTA |
| MMMU | College-level VQA across 30 subjects | GPT-4o 69.1%, Gemini 1.5 Pro 65.8% |
| MMBench | Perception, reasoning, knowledge | GPT-4V 75.8% |
| DocVQA | Document understanding, tables, forms | InternVL2 96.4% |
| TextVQA | OCR and text reading in images | GPT-4V 78.0% |
| SEED-Bench | Image + video spatial reasoning | Gemini 1.5 Pro 74.6% |
| FID (images) | Generation quality (lower = better) | FLUX.1 ~2.0 |
For production multimodal systems, move beyond leaderboard scores. Build a domain-specific eval set with images representative of your actual use case. Measure: extraction accuracy (structured outputs from images), hallucination rate (does the model invent content not in the image?), and latency (VLMs are typically 2–5× slower than text-only calls for the same output length).
# Automated multimodal eval: check if model hallucinates about images
import openai, base64, json
from pathlib import Path
client = openai.OpenAI()
def eval_image_extraction(image_path: str, expected: dict) -> dict:
"""Test if VLM accurately extracts structured data from an image."""
b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
{"type": "text", "text": "Extract: {" + ", ".join(expected.keys()) + "} as JSON."}
]
}],
response_format={"type": "json_object"},
)
actual = json.loads(response.choices[0].message.content)
correct = sum(1 for k,v in expected.items() if str(actual.get(k,"")).lower() == str(v).lower())
return {"accuracy": correct/len(expected), "actual": actual, "expected": expected}
result = eval_image_extraction(
"invoice.jpg",
{"vendor": "Acme Corp", "total": "1250.00", "date": "2024-03-15"}
)
print(f"Accuracy: {result['accuracy']*100:.0f}%")