System Design

Multimodal Native

Architecting AI systems that treat multiple modalities (text, image, audio, video) as first-class inputs and outputs from the ground up — not as bolted-on extensions.

Modalities
text/image/audio/video
Complexity increase
3–5×
User value
high for rich domains

Table of Contents

SECTION 01

Why Multimodal Native?

Text-only AI misses most of human communication: medical imaging, product photos, audio recordings, instructional videos. Multimodal-native systems don't convert everything to text first — they process each modality in its natural form and reason across modalities jointly. This unlocks use cases: visual QA, audio transcription + reasoning, document understanding with layout.

SECTION 02

Input Routing & Preprocessing

Route incoming requests by MIME type to the appropriate preprocessor before sending to the model. " "Images: resize, convert to base64 or URL reference. Audio: transcribe (Whisper) or pass raw to audio-capable model. " "Video: extract keyframes (every N seconds) + audio track. Documents: extract text + images with layout preservation.

import mimetypes
from pathlib import Path
def preprocess_input(file_path: str) -> dict:
    mime, _ = mimetypes.guess_type(file_path)
    if mime and mime.startswith("image/"):
        return {"type": "image_url", "image_url": {"url": to_data_url(file_path)}}
    elif mime and mime.startswith("audio/"):
        transcript = transcribe_audio(file_path)  # Whisper
        return {"type": "text", "text": f"[Audio transcript]: {transcript}"}
    elif mime == "application/pdf":
        pages = extract_pdf_pages(file_path)  # text + images
        return {"type": "multipart", "pages": pages}
    else:
        return {"type": "text", "text": Path(file_path).read_text()}
SECTION 03

Cross-Modal Retrieval

Standard text embeddings can't retrieve by image similarity. Use CLIP or SigLIP for image-text joint embeddings — the same vector space for queries ('a cat on a chair') and images. Store both image embeddings and text description embeddings in your vector DB. At query time, embed the user's text query and retrieve the nearest images or text chunks jointly.

SECTION 04

Output Generation Strategy

For multi-modal outputs, chain specialists: text generation (LLM) + image generation (DALL-E, SDXL) + speech synthesis (ElevenLabs, TTS). Coordinate outputs into a coherent response: generate text first, extract image prompts from text, generate images, stitch together. Alternatively, use a native multimodal model that can interleave text and image tokens (Gemini, GPT-4o).

SECTION 05

Evaluation Challenges

Image quality is subjective; use FID, CLIP score, and human eval. Audio accuracy: WER (word error rate) for transcription, MOS (mean opinion score) for synthesis. Cross-modal coherence: does the generated image match the text description? Use a CLIP coherence score. Multi-modal evals require specialised harnesses — start with text accuracy, add modality-specific metrics incrementally.

SECTION 06

Reference Architecture

Ingestion layer: accept any MIME type, route to modality preprocessors. Embedding layer: modality-specific encoders → shared vector space. Retrieval layer: cross-modal vector search + re-ranking. Generation layer: multimodal LLM for reasoning + specialist generators for output modalities. Evaluation layer: per-modality + cross-modal quality checks.

SECTION 07

Vision-Language Integration Patterns

Multimodal-native models (GPT-4V, Claude 3, Gemini) jointly process text, images, and sometimes audio in a single forward pass. Key design patterns: (1) Token patching — convert images to tokens via a Vision Transformer, concatenate with text embeddings, process jointly, (2) Fusion layers — separate encoders for each modality, fuse via cross-attention, (3) Late fusion — separate forward passes, combine at decision layer. Early fusion (at the token level) typically outperforms late fusion because the model can learn fine-grained correlations. Challenges: long-range dependencies (an image with many regions + long text = millions of tokens), alignment (which image regions correspond to which text phrases?), and efficiency (processing 100K tokens per request is expensive).

Modality CombinationInput TokensComplexityApplication
Text + Image1K–10KLowVQA, caption generation
Text + Audio500–5KLowSpeech recognition, QA
Text + Image + Audio10K–100KVery HighVideo understanding
Text + 3D/Point Cloud50K+Very HighScene understanding
def multimodal_encode(text: str, image_path: str, model_name: str = "gpt-4-vision"):
    """Encode text and image jointly for multimodal understanding."""
    import base64
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode()
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": text},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}
            ]}
        ]
    )
    return response.choices[0].message.content
SECTION 08

Evaluation Benchmarks & Challenges

Evaluating multimodal models requires task-specific benchmarks. For vision-language: COCO Captions (caption generation), Visual QA (multiple-choice QA on images), RefCOCO (referring expression comprehension). For audio-language: SQuAD (reading comprehension on transcripts), Spoken QA (QA on audio). For video: YouCook2 (action recognition + captioning), TVQA (temporal QA on videos). Standard metrics: BLEU/METEOR (captions), Exact Match (QA), mAP (detection). Hidden challenges: distribution shift (models trained on web images fail on medical/satellite images), robustness to adversarial perturbations, and fairness (do they work equally well for all demographic groups?). Cross-modal retrieval (finding images matching text descriptions) is a separate evaluation challenge with its own metrics (Recall@K, NDCG).

Tokenization for multimodal inputs is non-trivial. For text, tokenizers map words to integers; for images, a Vision Transformer (ViT) extracts patch embeddings. Combining them requires alignment: how many image tokens represent one text token? GPT-4V uses a dynamic ratio (different images contribute different token counts based on resolution). Larger images = more tokens = higher cost. This creates an economic incentive to use lower-resolution images, but quality suffers. Some models allow users to specify resolution (e.g., "low, medium, high"), trading off quality and cost. For audio, spectrograms are converted to embeddings; the alignment problem is similar. Video is the hardest: process every frame (expensive) or sample frames (lose information)? Models use a mix: sample keyframes, extract features from each, then use temporal pooling or attention. The resulting token count can be 10–100x higher than text-only inputs.

Cross-modal alignment is a learned problem, not handed by hand. Contrastive learning helps: train the model to align images with their text descriptions, audio with transcripts. The loss function pushes matching pairs together and mismatched pairs apart. This teaches the model that "cat" image is similar to "cat" text. Models trained with strong alignment generalize better to novel inputs. Some recent models use masked language modeling (hide some tokens, predict them from other modalities): given an image, predict its caption; given audio, predict transcription. This bidirectional training improves alignment. For proprietary models (GPT-4V), the alignment is trained on internet-scale data; for open models (LLaVA, Flamingo), alignment quality depends on training data. Smaller alignment datasets lead to misalignment (weird captions, hallucinations).

Hallucination in multimodal models is a well-known problem. The model sees an image of a cat and a dog, but generates text mentioning a bird that isn't there. This happens when the image understanding is weak (the bird-like patch is misinterpreted) or the language model generates plausible-sounding but false details. Mitigation strategies: (1) Training on data with minimal hallucinations (require faithful captions), (2) Constrained decoding (force the model to ground outputs in image regions), (3) Post-hoc verification (ask a vision model to verify generated text against the image). None are perfect. For safety-critical applications (medical imaging diagnosis), don't rely on the model's confidence; use it as an assistant to highlight regions for human review. For generative tasks (creative captioning), some hallucination is acceptable (even stylistic). Understand your tolerance and build accordingly.

Representation learning for multimodal data seeks a unified embedding space where semantically similar examples (across modalities) are close. Contrastive objectives (like CLIP's) achieve this: "dog" text, dog image, and dog audio should be near each other; dog and cat should be far apart. The loss function pushes matching pairs together and mismatched pairs apart. Scale matters: CLIP was trained on 400 million image-text pairs. Smaller models trained on smaller datasets suffer from weak alignment. Transfer learning helps: start with CLIP embeddings (trained on massive data), fine-tune on your task-specific data (smaller). This is more sample-efficient than training from scratch. Multi-task learning is complementary: jointly train on caption generation, VQA, and retrieval. Shared representations learned for one task benefit others. Some models (e.g., LLaVA) use LoRA (Low-Rank Adaptation) for efficient fine-tuning: add small learnable matrices to the alignment layers, keep base model frozen. This is faster and cheaper than full fine-tuning.

Interpretability in multimodal models is harder than in single-modality models. For text-only, attention visualization reveals which words the model focuses on. For multimodal, you need to visualize alignment: which image regions did the model focus on when generating text? Attention heatmaps (overlay attention weights on the image) help. But attention isn't always interpretable—the model might attend to a region that's not semantically relevant. Saliency maps (gradient-based visualization showing which pixels affect the output) provide complementary views. For safety-critical applications (medical imaging), explainability is essential: show the clinician which regions the model attended to when making a diagnosis. This enables human verification and debugging. For research, interpretability tools reveal whether the model is learning meaningful alignments or exploiting spurious correlations. Invest in these tools early; they're invaluable for development and debugging.

Multimodal models excel at vision-language tasks but struggle with some modalities. Audio is underrepresented: most models are trained on text and images, not audio. Combining audio and text requires converting audio to text (speech-to-text, lossy) or embedding audio directly (rare). 3D data (point clouds, meshes) is rarely used; most models don't handle it. Video is computationally expensive (hundreds of frames = millions of tokens). Specialized models for these modalities are emerging but not yet mainstream. Cross-modality transfer is interesting: a model trained on image-text can transfer to video (treating video frames as images) or audio-text (if you have aligned training data). For new modalities, consider whether pretraining makes sense: training from scratch is expensive; starting from pretrained image-text models and fine-tuning on your modality is often better. Modality bridges (e.g., speech-to-text as a preprocessing step) allow you to use existing multimodal models with new input types, though you lose modality-specific information.