Multimodal AI

Contents

What multimodal means
Vision-language models
Image generation
Audio models
Working code examples
What to explore next
References

01 — Foundation

What Multimodal Means

Multimodal AI processes information from multiple data formats simultaneously: images, text, audio, video, and numerical data. Traditional LLMs understand only text. Multimodal models unify different modalities into a shared embedding space, allowing a single model to reason across vision, language, and sound.

The core innovation is cross-modal alignment: embedding different modalities (image and text, speech and text) into the same vector space so similar concepts have similar representations regardless of modality. This enables image-to-text retrieval, visual question answering, captioning, and more.

Two Directions: Understanding vs. Generation

Direction	Input	Output	Models
Understanding	Image / audio / video	Text / embedding	GPT-4V, LLaVA, CLIP, Whisper
Generation	Text (prompt)	Image / audio / video	DALL-E 3, Stable Diffusion, Flux, TTS

💡 Key distinction: Multimodal means one model handles multiple inputs. Cross-modal means aligning different modalities into the same space. Unified foundation models can understand and generate across all modalities.

02 — Understanding

Vision-Language Models

Vision-language models (VLMs) analyze images and answer questions about them. They combine a vision encoder (e.g., ViT) with an LLM decoder to output natural language descriptions, answers, or reasoning. Modern VLMs achieve remarkable performance on image understanding, visual reasoning, document parsing, and diagram analysis.

Key Models

GPT-4V — Most capable

OpenAI's multimodal model handles image input with strong visual reasoning, OCR, and chart interpretation.

Integrates with ChatGPT API
Excellent at document analysis, scene understanding, visual math
Context window up to 128K tokens
Highest cost but strongest performance

LLaVA — Open-source alternative

Large Language and Vision Assistant; efficient, open-weight model for image understanding. Trained on 665K image-text pairs.

Free, run locally or via Replicate/HuggingFace
Good for visual Q&A, captioning, scene understanding
Faster inference than GPT-4V
Smaller model, weaker on complex reasoning

PaliGemma — Google's lightweight

Efficient image understanding with dense captioning, object localization, and VQA. Fast inference on consumer hardware.

Based on Gemma 2B architecture
Strong on dense tasks (OCR, region understanding)
Low latency, suitable for real-time applications
Open weights, customizable

Qwen-VL — High resolution

Alibaba's vision-language model optimized for high-resolution images and dense text understanding.

Handles 1280×1280 images natively
Strong on document understanding and OCR
Supports multiple languages
Competitive with GPT-4V on many benchmarks

✓ When to use VLMs: Document extraction, visual Q&A, image captioning, accessibility (alt text generation), scene understanding, diagramming, architectural analysis.

03 — Generation

Image Generation

Diffusion models have become the backbone of modern image generation. Instead of directly generating pixels, they iteratively refine noise into coherent images guided by text prompts. This approach is more stable and controllable than older GAN-based methods.

Diffusion Model Pipeline

Diffusion adds noise to an image step-by-step, then trains a neural network to reverse this process. At generation time, the model starts with pure noise and iteratively denoises guided by a text prompt (via CLIP embeddings or cross-attention). More denoising steps → higher quality but slower; fewer steps → faster but lower quality.

Key Models

Generation

DALL-E 3

OpenAI's state-of-the-art generative model with strong prompt interpretation and photorealism.

Generation

Stable Diffusion 3

Open-source flow-matching model for high-quality image generation, locally runnable.

Generation

Flux

Black Forest Labs' transformer-based diffusion model for superior image quality and speed.

Generation

Midjourney

Closed commercial service with Discord integration, known for artistic style control.

Generation

ControlNet

Fine-grained control over diffusion models using edge maps, poses, depth, segmentation.

Framework

Hugging Face Diffusers

Open-source library for working with diffusion models locally.

⚠️ Image generation cost: DALL-E 3 and Midjourney are API-based and carry per-image costs. Stable Diffusion and Flux are free to run locally but require VRAM (24GB+ for quality). Choose based on budget and latency requirements.

04 — Speech & Sound

Audio Models

Multimodal AI extends to sound: speech-to-text (ASR), text-to-speech (TTS), and audio understanding. These enable voice interfaces, accessibility, content creation, and speech analysis at scale.

Speech Recognition (ASR)

Whisper (OpenAI) is the gold standard: robust, multilingual, and open-source. Trained on 680K hours of multilingual audio from the web. Fast inference, handles accents and background noise well. Use for transcription, subtitles, voice command interfaces.

Text-to-Speech (TTS)

TTS

OpenAI TTS

Simple API for natural-sounding speech in 6 voices. Low latency, streaming support.

TTS

ElevenLabs

Custom voice cloning, emotional tone control, multilingual support with natural prosody.

TTS

Google Cloud TTS

Enterprise-grade, 250+ voices, supports markup for pronunciation control.

Framework

AudioCraft

Meta's model family for music generation and audio synthesis from text.

ASR

Whisper

OpenAI's robust multilingual speech recognition, open-source, handles accents and noise.

✓ Audio workflows: Transcribe user speech → send to LLM → generate response → convert back to speech. This creates natural voice interfaces without needing separate dialogue models.

05 — Implementation

Working Code Examples

1. Vision: Analyze Image with Claude

import anthropic import base64 from pathlib import Path client = anthropic.Anthropic() def analyze_image(image_path: str, question: str) -> str: """Analyze an image and answer a question about it.""" with open(image_path, "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8") resp = client.messages.create( model="claude-opus-4-5", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data, }, }, {"type": "text", "text": question} ], }] ) return resp.content[0].text # Usage answer = analyze_image("photo.jpg", "What is in this image?") print(answer)

2. Audio: Transcribe with Whisper

import whisper def transcribe_audio(audio_path: str) -> str: """Transcribe audio file to text using Whisper.""" model = whisper.load_model("base") result = model.transcribe(audio_path) return result["text"] # Usage text = transcribe_audio("speech.mp3") print(f"Transcription: {text}")

3. Image Generation with Stable Diffusion

from diffusers import StableDiffusionPipeline import torch def generate_image(prompt: str, output_path: str): """Generate an image from a text prompt.""" pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ) pipe = pipe.to("cuda") image = pipe(prompt).images[0] image.save(output_path) print(f"Image saved to {output_path}") # Usage generate_image("A futuristic city at sunset", "city.png")

4. Complete Multimodal Pipeline

# Full workflow: speech → transcribe → analyze image → generate response → speak # Step 1: User speaks # audio_input = record_audio() # Step 2: Transcribe speech user_query = transcribe_audio("user_audio.wav") # Step 3: Analyze image with query answer = analyze_image("photo.jpg", user_query) # Step 4: Generate speech from answer from openai import OpenAI client = OpenAI() response = client.audio.speech.create( model="tts-1", voice="nova", input=answer ) response.stream_to_file("response.wav") print(f"User asked: {user_query}") print(f"Answer: {answer}")

💡 Best practice: Start with a single modality (e.g., image analysis), test thoroughly, then layer in others. Multimodal pipelines compound latency — monitor end-to-end performance.

06 — Child Concepts

What to Explore Next

Multimodal AI is vast. Dive deeper into specific modalities and tasks:

Vision-Language Models → concepts/vision-language.html

Deep dive into how VLMs work, model architectures (ViT + LLM fusion), fine-tuning on custom tasks, and building image-understanding applications.

Image Generation → concepts/image-gen.html

Diffusion model mechanics, latent-space generation, prompt engineering for image generation, fine-tuning on custom styles, and production deployment.

Audio Models → concepts/audio-models.html

ASR systems (Whisper internals), TTS architectures, voice cloning, emotional prosody, and building conversational voice interfaces.

Video Models → concepts/video-models.html

Video generation (Sora, Runway), temporal reasoning, frame interpolation, and working with long-context video understanding.

07 — Further Reading

References

Academic Papers

Paper Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020. — arxiv:2103.00020 ↗
Paper Liu, H. et al. (2023). LLaVA: Large Language and Vision Assistant. arXiv:2304.08485. — arxiv:2304.08485 ↗
Paper Rombach, R. et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752. — arxiv:2112.10752 ↗
Paper Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). arXiv:2212.04356. — arxiv:2212.04356 ↗
Paper (2023). Multimodal Foundation Models: From Specialists to General-Purpose Assistants. arXiv:2309.07915. — arxiv:2309.07915 ↗
Paper Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformers). arXiv:2010.11929. — arxiv:2010.11929 ↗
Blog OpenAI. (2023). GPT-4V System Card. — openai.com ↗

Multimodal Evaluation and Benchmarks

Evaluating multimodal models requires benchmarks that test both individual modalities and cross-modal understanding. No single benchmark captures everything — use a suite covering perception, reasoning, and generation quality.

Benchmark	Tests	2024 SOTA
MMMU	College-level VQA across 30 subjects	GPT-4o 69.1%, Gemini 1.5 Pro 65.8%
MMBench	Perception, reasoning, knowledge	GPT-4V 75.8%
DocVQA	Document understanding, tables, forms	InternVL2 96.4%
TextVQA	OCR and text reading in images	GPT-4V 78.0%
SEED-Bench	Image + video spatial reasoning	Gemini 1.5 Pro 74.6%
FID (images)	Generation quality (lower = better)	FLUX.1 ~2.0

For production multimodal systems, move beyond leaderboard scores. Build a domain-specific eval set with images representative of your actual use case. Measure: extraction accuracy (structured outputs from images), hallucination rate (does the model invent content not in the image?), and latency (VLMs are typically 2–5× slower than text-only calls for the same output length).

# Automated multimodal eval: check if model hallucinates about images import openai, base64, json from pathlib import Path client = openai.OpenAI() def eval_image_extraction(image_path: str, expected: dict) -> dict: """Test if VLM accurately extracts structured data from an image.""" b64 = base64.b64encode(Path(image_path).read_bytes()).decode() response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": [ {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}}, {"type": "text", "text": "Extract: {" + ", ".join(expected.keys()) + "} as JSON."} ] }], response_format={"type": "json_object"}, ) actual = json.loads(response.choices[0].message.content) correct = sum(1 for k,v in expected.items() if str(actual.get(k,"")).lower() == str(v).lower()) return {"accuracy": correct/len(expected), "actual": actual, "expected": expected} result = eval_image_extraction( "invoice.jpg", {"vendor": "Acme Corp", "total": "1250.00", "date": "2024-03-15"} ) print(f"Accuracy: {result['accuracy']*100:.0f}%")

Multimodal AI

What Multimodal Means

Two Directions: Understanding vs. Generation

Vision-Language Models

Key Models

GPT-4V — Most capable

LLaVA — Open-source alternative

PaliGemma — Google's lightweight

Qwen-VL — High resolution

Image Generation

Diffusion Model Pipeline

Key Models

Audio Models

Speech Recognition (ASR)

Text-to-Speech (TTS)

Working Code Examples

1. Vision: Analyze Image with Claude

2. Audio: Transcribe with Whisper

3. Image Generation with Stable Diffusion

4. Complete Multimodal Pipeline

What to Explore Next

Vision-Language Models → concepts/vision-language.html

Image Generation → concepts/image-gen.html

Audio Models → concepts/audio-models.html

Video Models → concepts/video-models.html

References

Multimodal Evaluation and Benchmarks

Related concepts