MULTIMODAL AI

Multimodal AI

AI that sees, hears, and reads — beyond text-only models

vision + text + audio the modalities
CLIP → alignment the mechanism
generate or understand two directions
Contents
  1. What multimodal means
  2. Vision-language models
  3. Image generation
  4. Audio models
  5. Working code examples
  6. What to explore next
  7. References
01 — Foundation

What Multimodal Means

Multimodal AI processes information from multiple data formats simultaneously: images, text, audio, video, and numerical data. Traditional LLMs understand only text. Multimodal models unify different modalities into a shared embedding space, allowing a single model to reason across vision, language, and sound.

The core innovation is cross-modal alignment: embedding different modalities (image and text, speech and text) into the same vector space so similar concepts have similar representations regardless of modality. This enables image-to-text retrieval, visual question answering, captioning, and more.

Two Directions: Understanding vs. Generation

DirectionInputOutputModels
UnderstandingImage / audio / videoText / embeddingGPT-4V, LLaVA, CLIP, Whisper
GenerationText (prompt)Image / audio / videoDALL-E 3, Stable Diffusion, Flux, TTS
💡 Key distinction: Multimodal means one model handles multiple inputs. Cross-modal means aligning different modalities into the same space. Unified foundation models can understand and generate across all modalities.
02 — Understanding

Vision-Language Models

Vision-language models (VLMs) analyze images and answer questions about them. They combine a vision encoder (e.g., ViT) with an LLM decoder to output natural language descriptions, answers, or reasoning. Modern VLMs achieve remarkable performance on image understanding, visual reasoning, document parsing, and diagram analysis.

Key Models

1

GPT-4V — Most capable

OpenAI's multimodal model handles image input with strong visual reasoning, OCR, and chart interpretation.

  • Integrates with ChatGPT API
  • Excellent at document analysis, scene understanding, visual math
  • Context window up to 128K tokens
  • Highest cost but strongest performance
2

LLaVA — Open-source alternative

Large Language and Vision Assistant; efficient, open-weight model for image understanding. Trained on 665K image-text pairs.

  • Free, run locally or via Replicate/HuggingFace
  • Good for visual Q&A, captioning, scene understanding
  • Faster inference than GPT-4V
  • Smaller model, weaker on complex reasoning
3

PaliGemma — Google's lightweight

Efficient image understanding with dense captioning, object localization, and VQA. Fast inference on consumer hardware.

  • Based on Gemma 2B architecture
  • Strong on dense tasks (OCR, region understanding)
  • Low latency, suitable for real-time applications
  • Open weights, customizable
4

Qwen-VL — High resolution

Alibaba's vision-language model optimized for high-resolution images and dense text understanding.

  • Handles 1280×1280 images natively
  • Strong on document understanding and OCR
  • Supports multiple languages
  • Competitive with GPT-4V on many benchmarks
When to use VLMs: Document extraction, visual Q&A, image captioning, accessibility (alt text generation), scene understanding, diagramming, architectural analysis.
03 — Generation

Image Generation

Diffusion models have become the backbone of modern image generation. Instead of directly generating pixels, they iteratively refine noise into coherent images guided by text prompts. This approach is more stable and controllable than older GAN-based methods.

Diffusion Model Pipeline

Diffusion adds noise to an image step-by-step, then trains a neural network to reverse this process. At generation time, the model starts with pure noise and iteratively denoises guided by a text prompt (via CLIP embeddings or cross-attention). More denoising steps → higher quality but slower; fewer steps → faster but lower quality.

Key Models

Generation
DALL-E 3
OpenAI's state-of-the-art generative model with strong prompt interpretation and photorealism.
Generation
Stable Diffusion 3
Open-source flow-matching model for high-quality image generation, locally runnable.
Generation
Flux
Black Forest Labs' transformer-based diffusion model for superior image quality and speed.
Generation
Midjourney
Closed commercial service with Discord integration, known for artistic style control.
Generation
ControlNet
Fine-grained control over diffusion models using edge maps, poses, depth, segmentation.
Framework
Hugging Face Diffusers
Open-source library for working with diffusion models locally.
⚠️ Image generation cost: DALL-E 3 and Midjourney are API-based and carry per-image costs. Stable Diffusion and Flux are free to run locally but require VRAM (24GB+ for quality). Choose based on budget and latency requirements.
04 — Speech & Sound

Audio Models

Multimodal AI extends to sound: speech-to-text (ASR), text-to-speech (TTS), and audio understanding. These enable voice interfaces, accessibility, content creation, and speech analysis at scale.

Speech Recognition (ASR)

Whisper (OpenAI) is the gold standard: robust, multilingual, and open-source. Trained on 680K hours of multilingual audio from the web. Fast inference, handles accents and background noise well. Use for transcription, subtitles, voice command interfaces.

Text-to-Speech (TTS)

TTS
OpenAI TTS
Simple API for natural-sounding speech in 6 voices. Low latency, streaming support.
TTS
ElevenLabs
Custom voice cloning, emotional tone control, multilingual support with natural prosody.
TTS
Google Cloud TTS
Enterprise-grade, 250+ voices, supports markup for pronunciation control.
Framework
AudioCraft
Meta's model family for music generation and audio synthesis from text.
ASR
Whisper
OpenAI's robust multilingual speech recognition, open-source, handles accents and noise.
Audio workflows: Transcribe user speech → send to LLM → generate response → convert back to speech. This creates natural voice interfaces without needing separate dialogue models.
05 — Implementation

Working Code Examples

1. Vision: Analyze Image with Claude

import anthropic import base64 from pathlib import Path client = anthropic.Anthropic() def analyze_image(image_path: str, question: str) -> str: """Analyze an image and answer a question about it.""" with open(image_path, "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8") resp = client.messages.create( model="claude-opus-4-5", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data, }, }, {"type": "text", "text": question} ], }] ) return resp.content[0].text # Usage answer = analyze_image("photo.jpg", "What is in this image?") print(answer)

2. Audio: Transcribe with Whisper

import whisper def transcribe_audio(audio_path: str) -> str: """Transcribe audio file to text using Whisper.""" model = whisper.load_model("base") result = model.transcribe(audio_path) return result["text"] # Usage text = transcribe_audio("speech.mp3") print(f"Transcription: {text}")

3. Image Generation with Stable Diffusion

from diffusers import StableDiffusionPipeline import torch def generate_image(prompt: str, output_path: str): """Generate an image from a text prompt.""" pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ) pipe = pipe.to("cuda") image = pipe(prompt).images[0] image.save(output_path) print(f"Image saved to {output_path}") # Usage generate_image("A futuristic city at sunset", "city.png")

4. Complete Multimodal Pipeline

# Full workflow: speech → transcribe → analyze image → generate response → speak # Step 1: User speaks # audio_input = record_audio() # Step 2: Transcribe speech user_query = transcribe_audio("user_audio.wav") # Step 3: Analyze image with query answer = analyze_image("photo.jpg", user_query) # Step 4: Generate speech from answer from openai import OpenAI client = OpenAI() response = client.audio.speech.create( model="tts-1", voice="nova", input=answer ) response.stream_to_file("response.wav") print(f"User asked: {user_query}") print(f"Answer: {answer}")
💡 Best practice: Start with a single modality (e.g., image analysis), test thoroughly, then layer in others. Multimodal pipelines compound latency — monitor end-to-end performance.
06 — Child Concepts

What to Explore Next

Multimodal AI is vast. Dive deeper into specific modalities and tasks:

1

Vision-Language Models → concepts/vision-language.html

Deep dive into how VLMs work, model architectures (ViT + LLM fusion), fine-tuning on custom tasks, and building image-understanding applications.

2

Image Generation → concepts/image-gen.html

Diffusion model mechanics, latent-space generation, prompt engineering for image generation, fine-tuning on custom styles, and production deployment.

3

Audio Models → concepts/audio-models.html

ASR systems (Whisper internals), TTS architectures, voice cloning, emotional prosody, and building conversational voice interfaces.

4

Video Models → concepts/video-models.html

Video generation (Sora, Runway), temporal reasoning, frame interpolation, and working with long-context video understanding.

07 — Further Reading

References

Academic Papers
8

Multimodal Evaluation and Benchmarks

Evaluating multimodal models requires benchmarks that test both individual modalities and cross-modal understanding. No single benchmark captures everything — use a suite covering perception, reasoning, and generation quality.

BenchmarkTests2024 SOTA
MMMUCollege-level VQA across 30 subjectsGPT-4o 69.1%, Gemini 1.5 Pro 65.8%
MMBenchPerception, reasoning, knowledgeGPT-4V 75.8%
DocVQADocument understanding, tables, formsInternVL2 96.4%
TextVQAOCR and text reading in imagesGPT-4V 78.0%
SEED-BenchImage + video spatial reasoningGemini 1.5 Pro 74.6%
FID (images)Generation quality (lower = better)FLUX.1 ~2.0

For production multimodal systems, move beyond leaderboard scores. Build a domain-specific eval set with images representative of your actual use case. Measure: extraction accuracy (structured outputs from images), hallucination rate (does the model invent content not in the image?), and latency (VLMs are typically 2–5× slower than text-only calls for the same output length).

# Automated multimodal eval: check if model hallucinates about images import openai, base64, json from pathlib import Path client = openai.OpenAI() def eval_image_extraction(image_path: str, expected: dict) -> dict: """Test if VLM accurately extracts structured data from an image.""" b64 = base64.b64encode(Path(image_path).read_bytes()).decode() response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": [ {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}}, {"type": "text", "text": "Extract: {" + ", ".join(expected.keys()) + "} as JSON."} ] }], response_format={"type": "json_object"}, ) actual = json.loads(response.choices[0].message.content) correct = sum(1 for k,v in expected.items() if str(actual.get(k,"")).lower() == str(v).lower()) return {"accuracy": correct/len(expected), "actual": actual, "expected": expected} result = eval_image_extraction( "invoice.jpg", {"vendor": "Acme Corp", "total": "1250.00", "date": "2024-03-15"} ) print(f"Accuracy: {result['accuracy']*100:.0f}%")