Multimodal · Video

Video AI Models

Video generation, understanding, and captioning — Sora, VideoLLaMA, and the temporal modelling challenge

3 task types
6 sections
7 models
Contents
  1. Task map
  2. Temporal modelling
  3. Video generation
  4. Understanding
  5. Efficient processing
  6. Evaluation metrics
  7. References
01 — Landscape

Video AI Task Map

Video Generation: Text description → video frames. "A dog running on a beach" → 16 seconds of video. Conditional on text, or unconditional (pure generation).

Video Understanding/Captioning: Video frames → text description. What's happening? Answer questions about the video. Timestamps of actions.

Action Recognition: Video → label of actions occurring. Classification. "Person jumping" vs "person sitting."

Temporal Reasoning: Understanding causality, order, speed. "Before/after," "faster/slower," multi-step reasoning over frames.

💡 Key challenge: Video is massive: 30 fps × 60 seconds = 1800 frames. That's 1800 images to encode/decode. Transformers on frames are prohibitively expensive. Compression (video codecs) and hierarchical processing (patches, tokens) are essential.
02 — Architecture

Temporal Modelling Approaches

How do we represent video as input to a model?

Approach Computation Memory Quality Best For
3D Convolution (C3D) O(T·H·W·C²) High Good Action recognition (shorter videos)
Temporal Attention O(T²·H·W·C) Very High Excellent Video generation, long-range dependencies
Video Tokens O(T·K²) Low Good (with good codec) Large-scale generation, efficient processing
Factored Space-Time O(T·H·W) + O(T²) Low Medium-High Efficient long-range temporal reasoning

3D Conv: Convolve over space (H×W) and time (T). Standard but expensive for long videos. Good for short clips.

Temporal Attention: Transformer processes all frames, attends across time. High quality but quadratic in T. Restricted to short videos (4–16 frames typically).

Video Tokens: Compress video via codec (e.g., EnCodec) → discrete tokens, like text. Process with language model. Scales to longer videos. Key approach in MusicGen, AudioLM analogs.

Factored Space-Time: Separate spatial and temporal attention. Space attention per-frame, temporal attention across frames. Good tradeoff: linear in frames but retains some long-range capacity.

03 — Generation

Video Generation Models

Sora (OpenAI): Diffusion transformer. Processes video as patches + temporal tokens. DiT (Diffusion Transformer) backbone. Text conditioning via CLIP embeddings. Generates up to 60 seconds at 1080p.

Mochi (Genmo): Latent diffusion. Encodes video to latent space (much smaller), diffuses in latent space, decodes. ~3× faster than pixel-space diffusion.

CogVideoX (THUDM): Transformer-based video generation. Temporal and spatial attention. Open-source, smaller than Sora but reasonable quality.

Model Resolution Duration Speed Quality
Sora Up to 1080p 60 sec Slow (API) Excellent, coherent motion
Mochi 768×768 ~10 sec Medium High quality, fast
CogVideoX 576×1024 ~6 sec Medium Good, open-source
Wan 2.1 1024×1024 ~10 sec Medium Very high quality, realistic

Architecture Insight: All modern video models use latent diffusion + transformer backbone. The differences are in: - Compression codec (how much is latent space reduced) - Temporal attention mechanism (3D conv vs factored vs full) - Conditioning (text embedding quality, classifier-free guidance)

Python: CogVideoX (Open-Source)

from diffusers import CogVideoXPipeline import torch pipe = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-2B-I2V", torch_dtype=torch.float16 ).to("cuda") prompt = "A dog running on a beach" frames = pipe( prompt=prompt, num_frames=49, # ~2 seconds at 24fps guidance_scale=7.5, ).frames # frames: list of PIL Images
⚠️ Generation cost: Video generation is expensive. A single 1080p 60-second video from Sora likely costs $10+. Fine-tuning requires massive data. Most teams use APIs (Sora, Runway) rather than training.
04 — Understanding

Video Understanding Models

VideoLLaMA 2: Multimodal LLM for video. Encodes video frames (sampled uniformly) with vision encoder, processes with LLM. Answer questions about video content, temporal reasoning.

LLaVA-Video: Similar architecture to image LLaVA but adapted for video. Lower frame sampling overhead, hierarchical temporal attention.

Frame Sampling Strategy: Can't process all frames (too expensive). Sample uniformly: 30 fps video → sample every 2nd frame (15 fps). Tradeoff: fewer frames = faster but potential miss of fast actions.

Python: VideoLLaMA 2

from videollama2 import VideoLLaMA2Model model = VideoLLaMA2Model.from_pretrained( "DAMO-NLP-SG/videollama2-7b" ) query = "What is the person doing?" answer = model.answer_question( video_path="video.mp4", question=query, num_frames=8 # Sampled frames ) print(answer)

Long Video Handling

For videos > 1 minute, split into chunks, process each chunk, aggregate answers. Or use memory-augmented approach: keep compressed summary of previous chunks, process current chunk.

Benchmarks: VideoChatGPT, TVQA (temporal video QA), ActivityNet captions. Measure caption quality (BLEU, CIDEr), question answering accuracy.

05 — Scaling

Efficient Video Processing

Frame Subsampling: Don't process every frame. Sample every N frames (e.g., every 3rd frame in 30fps video). Loss of fine-grained motion, but huge speedup.

Keyframe Extraction: Use optical flow or scene change detection to identify keyframes. Process only keyframes + interpolate. Preserves important moments.

Sliding Window Attention: Limit attention to neighboring frames (e.g., attend to ±4 frames). Linear complexity in T instead of quadratic. Slight quality loss, huge speedup.

Hierarchical Video Representation: Level 1: keyframes. Level 2: scenes. Coarse-to-fine reasoning. Answers simple questions on keyframes, only expand to full frames if needed.

Memory Budget Constraints

A 60-second 1080p video at 24fps is 1440 frames. At 4 tokens per frame, that's 5760 tokens. If model context is 4k tokens, can't fit. Solution: Use compression (video codec → 0.2 tokens/frame), hierarchical processing, or temporal aggregation (concat consecutive frames).

💡 Practical insight: For on-device video processing, use lightweight codecs (H.265 > H.264 > MPEG-4) and aggressive frame subsampling (1 fps or less). For cloud inference, spend compute on fewer, better frames.
Python · Video understanding with GPT-4o (frame sampling approach)
import base64, cv2
from openai import OpenAI
from pathlib import Path

client = OpenAI()

def extract_frames(video_path: str, max_frames: int = 16) -> list[str]:
    """Sample frames evenly from a video and return as base64 strings."""
    cap = cv2.VideoCapture(video_path)
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = [int(i * total / max_frames) for i in range(max_frames)]
    frames = []
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            _, buf = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 80])
            frames.append(base64.b64encode(buf).decode())
    cap.release()
    return frames

def analyze_video(video_path: str, question: str) -> str:
    """Answer a question about a video by sampling frames."""
    frames = extract_frames(video_path, max_frames=16)
    content = [{"type": "text", "text": f"These are {len(frames)} evenly-sampled frames from a video. {question}"}]
    for b64 in frames:
        content.append({"type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{b64}", "detail": "low"}})
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=512
    ).choices[0].message.content

# Example
answer = analyze_video("demo.mp4", "What is happening in this video? Describe the main action.")
print(answer)
06 — Quality

Evaluation Metrics for Video

FVD (Fréchet Video Distance): Inception3D embeddings of real vs generated videos. Lower is better. Analogous to FID for images. Gold standard for generation quality.

CLIP Similarity: Encode video frames + text with CLIP. Cosine similarity between text embedding and video embeddings. High similarity = good alignment to prompt.

Inception Score (IS): Inception model predictions on frames. Measures diversity and quality. Less reliable than FVD, but faster.

Human Evaluation Dimensions:

For understanding/captioning: BLEU, CIDEr, METEOR scores (compare generated captions to ground truth). For VQA: accuracy (% correct answers).

07 — Ecosystem

Tools & Libraries

API
Sora
OpenAI's video generation API. SOTA quality, 60 sec videos.
Generation
CogVideoX
Open-source video generation. HuggingFace/Diffusers compatible.
Generation
Mochi
Latent diffusion video. Fast, high quality. Commercial API.
Understanding
VideoLLaMA
Multimodal LLM for video. Open-source variants available.
Understanding
LLaVA-Video
Vision LLM adapted for video. Efficient frame processing.
Codec
decord
Fast video reading. GPU-accelerated decoding.
Processing
OpenCV
Frame extraction, resizing, scene detection.
Processing
PyAV
Video I/O with ffmpeg. Flexible, low-level control.
08 — Further Reading

References

Academic Papers
Documentation & Tools
Practitioner Writing