Video generation, understanding, and captioning — Sora, VideoLLaMA, and the temporal modelling challenge
Video Generation: Text description → video frames. "A dog running on a beach" → 16 seconds of video. Conditional on text, or unconditional (pure generation).
Video Understanding/Captioning: Video frames → text description. What's happening? Answer questions about the video. Timestamps of actions.
Action Recognition: Video → label of actions occurring. Classification. "Person jumping" vs "person sitting."
Temporal Reasoning: Understanding causality, order, speed. "Before/after," "faster/slower," multi-step reasoning over frames.
How do we represent video as input to a model?
| Approach | Computation | Memory | Quality | Best For |
|---|---|---|---|---|
| 3D Convolution (C3D) | O(T·H·W·C²) | High | Good | Action recognition (shorter videos) |
| Temporal Attention | O(T²·H·W·C) | Very High | Excellent | Video generation, long-range dependencies |
| Video Tokens | O(T·K²) | Low | Good (with good codec) | Large-scale generation, efficient processing |
| Factored Space-Time | O(T·H·W) + O(T²) | Low | Medium-High | Efficient long-range temporal reasoning |
3D Conv: Convolve over space (H×W) and time (T). Standard but expensive for long videos. Good for short clips.
Temporal Attention: Transformer processes all frames, attends across time. High quality but quadratic in T. Restricted to short videos (4–16 frames typically).
Video Tokens: Compress video via codec (e.g., EnCodec) → discrete tokens, like text. Process with language model. Scales to longer videos. Key approach in MusicGen, AudioLM analogs.
Factored Space-Time: Separate spatial and temporal attention. Space attention per-frame, temporal attention across frames. Good tradeoff: linear in frames but retains some long-range capacity.
Sora (OpenAI): Diffusion transformer. Processes video as patches + temporal tokens. DiT (Diffusion Transformer) backbone. Text conditioning via CLIP embeddings. Generates up to 60 seconds at 1080p.
Mochi (Genmo): Latent diffusion. Encodes video to latent space (much smaller), diffuses in latent space, decodes. ~3× faster than pixel-space diffusion.
CogVideoX (THUDM): Transformer-based video generation. Temporal and spatial attention. Open-source, smaller than Sora but reasonable quality.
| Model | Resolution | Duration | Speed | Quality |
|---|---|---|---|---|
| Sora | Up to 1080p | 60 sec | Slow (API) | Excellent, coherent motion |
| Mochi | 768×768 | ~10 sec | Medium | High quality, fast |
| CogVideoX | 576×1024 | ~6 sec | Medium | Good, open-source |
| Wan 2.1 | 1024×1024 | ~10 sec | Medium | Very high quality, realistic |
Architecture Insight: All modern video models use latent diffusion + transformer backbone. The differences are in: - Compression codec (how much is latent space reduced) - Temporal attention mechanism (3D conv vs factored vs full) - Conditioning (text embedding quality, classifier-free guidance)
VideoLLaMA 2: Multimodal LLM for video. Encodes video frames (sampled uniformly) with vision encoder, processes with LLM. Answer questions about video content, temporal reasoning.
LLaVA-Video: Similar architecture to image LLaVA but adapted for video. Lower frame sampling overhead, hierarchical temporal attention.
Frame Sampling Strategy: Can't process all frames (too expensive). Sample uniformly: 30 fps video → sample every 2nd frame (15 fps). Tradeoff: fewer frames = faster but potential miss of fast actions.
For videos > 1 minute, split into chunks, process each chunk, aggregate answers. Or use memory-augmented approach: keep compressed summary of previous chunks, process current chunk.
Benchmarks: VideoChatGPT, TVQA (temporal video QA), ActivityNet captions. Measure caption quality (BLEU, CIDEr), question answering accuracy.
Frame Subsampling: Don't process every frame. Sample every N frames (e.g., every 3rd frame in 30fps video). Loss of fine-grained motion, but huge speedup.
Keyframe Extraction: Use optical flow or scene change detection to identify keyframes. Process only keyframes + interpolate. Preserves important moments.
Sliding Window Attention: Limit attention to neighboring frames (e.g., attend to ±4 frames). Linear complexity in T instead of quadratic. Slight quality loss, huge speedup.
Hierarchical Video Representation: Level 1: keyframes. Level 2: scenes. Coarse-to-fine reasoning. Answers simple questions on keyframes, only expand to full frames if needed.
A 60-second 1080p video at 24fps is 1440 frames. At 4 tokens per frame, that's 5760 tokens. If model context is 4k tokens, can't fit. Solution: Use compression (video codec → 0.2 tokens/frame), hierarchical processing, or temporal aggregation (concat consecutive frames).
import base64, cv2
from openai import OpenAI
from pathlib import Path
client = OpenAI()
def extract_frames(video_path: str, max_frames: int = 16) -> list[str]:
"""Sample frames evenly from a video and return as base64 strings."""
cap = cv2.VideoCapture(video_path)
total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
indices = [int(i * total / max_frames) for i in range(max_frames)]
frames = []
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
_, buf = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 80])
frames.append(base64.b64encode(buf).decode())
cap.release()
return frames
def analyze_video(video_path: str, question: str) -> str:
"""Answer a question about a video by sampling frames."""
frames = extract_frames(video_path, max_frames=16)
content = [{"type": "text", "text": f"These are {len(frames)} evenly-sampled frames from a video. {question}"}]
for b64 in frames:
content.append({"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}", "detail": "low"}})
return client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
max_tokens=512
).choices[0].message.content
# Example
answer = analyze_video("demo.mp4", "What is happening in this video? Describe the main action.")
print(answer)
FVD (Fréchet Video Distance): Inception3D embeddings of real vs generated videos. Lower is better. Analogous to FID for images. Gold standard for generation quality.
CLIP Similarity: Encode video frames + text with CLIP. Cosine similarity between text embedding and video embeddings. High similarity = good alignment to prompt.
Inception Score (IS): Inception model predictions on frames. Measures diversity and quality. Less reliable than FVD, but faster.
Human Evaluation Dimensions:
For understanding/captioning: BLEU, CIDEr, METEOR scores (compare generated captions to ground truth). For VQA: accuracy (% correct answers).