Video AI Models

Contents

Task map
Temporal modelling
Video generation
Understanding
Efficient processing
Evaluation metrics
References

01 — Landscape

Video AI Task Map

Video Generation: Text description → video frames. "A dog running on a beach" → 16 seconds of video. Conditional on text, or unconditional (pure generation).

Video Understanding/Captioning: Video frames → text description. What's happening? Answer questions about the video. Timestamps of actions.

Action Recognition: Video → label of actions occurring. Classification. "Person jumping" vs "person sitting."

Temporal Reasoning: Understanding causality, order, speed. "Before/after," "faster/slower," multi-step reasoning over frames.

💡 Key challenge: Video is massive: 30 fps × 60 seconds = 1800 frames. That's 1800 images to encode/decode. Transformers on frames are prohibitively expensive. Compression (video codecs) and hierarchical processing (patches, tokens) are essential.

02 — Architecture

Temporal Modelling Approaches

How do we represent video as input to a model?

Approach	Computation	Memory	Quality	Best For
3D Convolution (C3D)	O(T·H·W·C²)	High	Good	Action recognition (shorter videos)
Temporal Attention	O(T²·H·W·C)	Very High	Excellent	Video generation, long-range dependencies
Video Tokens	O(T·K²)	Low	Good (with good codec)	Large-scale generation, efficient processing
Factored Space-Time	O(T·H·W) + O(T²)	Low	Medium-High	Efficient long-range temporal reasoning

3D Conv: Convolve over space (H×W) and time (T). Standard but expensive for long videos. Good for short clips.

Temporal Attention: Transformer processes all frames, attends across time. High quality but quadratic in T. Restricted to short videos (4–16 frames typically).

Video Tokens: Compress video via codec (e.g., EnCodec) → discrete tokens, like text. Process with language model. Scales to longer videos. Key approach in MusicGen, AudioLM analogs.

Factored Space-Time: Separate spatial and temporal attention. Space attention per-frame, temporal attention across frames. Good tradeoff: linear in frames but retains some long-range capacity.

03 — Generation

Video Generation Models

Sora (OpenAI): Diffusion transformer. Processes video as patches + temporal tokens. DiT (Diffusion Transformer) backbone. Text conditioning via CLIP embeddings. Generates up to 60 seconds at 1080p.

Mochi (Genmo): Latent diffusion. Encodes video to latent space (much smaller), diffuses in latent space, decodes. ~3× faster than pixel-space diffusion.

CogVideoX (THUDM): Transformer-based video generation. Temporal and spatial attention. Open-source, smaller than Sora but reasonable quality.

Model	Resolution	Duration	Speed	Quality
Sora	Up to 1080p	60 sec	Slow (API)	Excellent, coherent motion
Mochi	768×768	~10 sec	Medium	High quality, fast
CogVideoX	576×1024	~6 sec	Medium	Good, open-source
Wan 2.1	1024×1024	~10 sec	Medium	Very high quality, realistic

Architecture Insight: All modern video models use latent diffusion + transformer backbone. The differences are in: - Compression codec (how much is latent space reduced) - Temporal attention mechanism (3D conv vs factored vs full) - Conditioning (text embedding quality, classifier-free guidance)

Python: CogVideoX (Open-Source)

from diffusers import CogVideoXPipeline import torch pipe = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-2B-I2V", torch_dtype=torch.float16 ).to("cuda") prompt = "A dog running on a beach" frames = pipe( prompt=prompt, num_frames=49, # ~2 seconds at 24fps guidance_scale=7.5, ).frames # frames: list of PIL Images

⚠️ Generation cost: Video generation is expensive. A single 1080p 60-second video from Sora likely costs $10+. Fine-tuning requires massive data. Most teams use APIs (Sora, Runway) rather than training.

04 — Understanding

Video Understanding Models

VideoLLaMA 2: Multimodal LLM for video. Encodes video frames (sampled uniformly) with vision encoder, processes with LLM. Answer questions about video content, temporal reasoning.

LLaVA-Video: Similar architecture to image LLaVA but adapted for video. Lower frame sampling overhead, hierarchical temporal attention.

Frame Sampling Strategy: Can't process all frames (too expensive). Sample uniformly: 30 fps video → sample every 2nd frame (15 fps). Tradeoff: fewer frames = faster but potential miss of fast actions.

Python: VideoLLaMA 2

from videollama2 import VideoLLaMA2Model model = VideoLLaMA2Model.from_pretrained( "DAMO-NLP-SG/videollama2-7b" ) query = "What is the person doing?" answer = model.answer_question( video_path="video.mp4", question=query, num_frames=8 # Sampled frames ) print(answer)

Long Video Handling

For videos > 1 minute, split into chunks, process each chunk, aggregate answers. Or use memory-augmented approach: keep compressed summary of previous chunks, process current chunk.

Benchmarks: VideoChatGPT, TVQA (temporal video QA), ActivityNet captions. Measure caption quality (BLEU, CIDEr), question answering accuracy.

05 — Scaling

Efficient Video Processing

Frame Subsampling: Don't process every frame. Sample every N frames (e.g., every 3rd frame in 30fps video). Loss of fine-grained motion, but huge speedup.

Keyframe Extraction: Use optical flow or scene change detection to identify keyframes. Process only keyframes + interpolate. Preserves important moments.

Sliding Window Attention: Limit attention to neighboring frames (e.g., attend to ±4 frames). Linear complexity in T instead of quadratic. Slight quality loss, huge speedup.

Hierarchical Video Representation: Level 1: keyframes. Level 2: scenes. Coarse-to-fine reasoning. Answers simple questions on keyframes, only expand to full frames if needed.

Memory Budget Constraints

A 60-second 1080p video at 24fps is 1440 frames. At 4 tokens per frame, that's 5760 tokens. If model context is 4k tokens, can't fit. Solution: Use compression (video codec → 0.2 tokens/frame), hierarchical processing, or temporal aggregation (concat consecutive frames).

💡 Practical insight: For on-device video processing, use lightweight codecs (H.265 > H.264 > MPEG-4) and aggressive frame subsampling (1 fps or less). For cloud inference, spend compute on fewer, better frames.

Python · Video understanding with GPT-4o (frame sampling approach)

import base64, cv2
from openai import OpenAI
from pathlib import Path

client = OpenAI()

def extract_frames(video_path: str, max_frames: int = 16) -> list[str]:
    """Sample frames evenly from a video and return as base64 strings."""
    cap = cv2.VideoCapture(video_path)
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = [int(i * total / max_frames) for i in range(max_frames)]
    frames = []
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            _, buf = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 80])
            frames.append(base64.b64encode(buf).decode())
    cap.release()
    return frames

def analyze_video(video_path: str, question: str) -> str:
    """Answer a question about a video by sampling frames."""
    frames = extract_frames(video_path, max_frames=16)
    content = [{"type": "text", "text": f"These are {len(frames)} evenly-sampled frames from a video. {question}"}]
    for b64 in frames:
        content.append({"type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{b64}", "detail": "low"}})
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=512
    ).choices[0].message.content

# Example
answer = analyze_video("demo.mp4", "What is happening in this video? Describe the main action.")
print(answer)

06 — Quality

Evaluation Metrics for Video

FVD (Fréchet Video Distance): Inception3D embeddings of real vs generated videos. Lower is better. Analogous to FID for images. Gold standard for generation quality.

CLIP Similarity: Encode video frames + text with CLIP. Cosine similarity between text embedding and video embeddings. High similarity = good alignment to prompt.

Inception Score (IS): Inception model predictions on frames. Measures diversity and quality. Less reliable than FVD, but faster.

Human Evaluation Dimensions:

Coherence: Motion realistic, physics plausible?
Alignment: Video matches text prompt?
Temporal consistency: Objects persist, smooth motion?
Flicker/artifacts: Distortions, jitter?

For understanding/captioning: BLEU, CIDEr, METEOR scores (compare generated captions to ground truth). For VQA: accuracy (% correct answers).

07 — Ecosystem

Tools & Libraries

API

Sora

OpenAI's video generation API. SOTA quality, 60 sec videos.

Generation

CogVideoX

Open-source video generation. HuggingFace/Diffusers compatible.

Generation

Mochi

Latent diffusion video. Fast, high quality. Commercial API.

Understanding

VideoLLaMA

Multimodal LLM for video. Open-source variants available.

Understanding

LLaVA-Video

Vision LLM adapted for video. Efficient frame processing.

Codec

decord

Fast video reading. GPU-accelerated decoding.

Processing

OpenCV

Frame extraction, resizing, scene detection.

Processing

PyAV

Video I/O with ffmpeg. Flexible, low-level control.

08 — Further Reading

References

Academic Papers

Paper Tran, D. et al. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv:1412.0767. — arxiv:1412.0767 ↗
Paper He, K. et al. (2016). Optical Flow Estimation using a Spatial-Temporal Convolutional Network. arXiv:1511.04143. — arxiv:1511.04143 ↗
Paper Unterthiner, T. et al. (2019). Towards Accurate Generative Models of Video: A New Metric & Challenges. FVD metric. arXiv:1812.01717. — arxiv:1812.01717 ↗

Documentation & Tools

Docs Diffusers Video Pipelines. huggingface.co ↗
Docs OpenCV Video Processing. opencv.org ↗
Docs Decord: Fast Video Decoding. github.com ↗
Guide PyAV Documentation. pyav.org ↗

Practitioner Writing

Blog OpenAI. (2024). Sora: A Diffusion Transformer for Video Generation. — openai.com ↗
Blog THUDM. (2024). CogVideoX: Advancing Video Generation with Transformers. — github.com ↗

Video AI Models

Video AI Task Map

Temporal Modelling Approaches

Video Generation Models

Python: CogVideoX (Open-Source)

Video Understanding Models

Python: VideoLLaMA 2

Long Video Handling

Efficient Video Processing

Memory Budget Constraints

Evaluation Metrics for Video

Tools & Libraries

References

Related concepts