Text-to-video

What it means

Text-to-video does for moving images what text-to-image did for stills: you describe a scene, and the model synthesizes a clip. As of 2026 the main players are OpenAI's Sora 2, Google's Veo 3, Runway Gen-4, Pika, and Kuaishou's Kling. Clip lengths have crept up — Sora 2 and Veo 3 can produce coherent 60-second shots at 1080p, with reasonably stable characters and camera moves. A few years ago two seconds was a triumph. Architecturally these are mostly diffusion transformers (DiTs) that operate on space-time latents — they compress video into a 3D latent grid and denoise it as a single tensor, rather than generating frames independently. That's how they get temporal coherence. Audio synthesis (Veo 3 generates dialogue and sound effects synced to the video) is a separate but tightly coupled model. Text-to-video lags text-to-image for three reasons. First, video data is orders of magnitude larger and harder to label — there's no LAION-5B equivalent for video. Second, the compute cost is brutal: a 10-second clip is roughly 240 frames, each needing diffusion, and the model has to keep them coherent. Third, the failure modes are more visible — a slightly weird hand in a still image is forgivable; morphing fingers across 24 frames per second is not. Expect cost-per-clip to keep falling fast (a 5-second Sora clip is about $0.50 in 2026, down from $5 a year ago) and clip length to keep climbing.

Example

A marketer prompts Runway with "drone shot pulling back from a couple hiking on a misty mountain ridge at sunrise, cinematic" and gets a 10-second 1080p clip in about 90 seconds for $1.20.

Why it matters

Video is the format with the highest commercial value (advertising, film, social) and the steepest remaining quality gap. Tracking text-to-video progress is the best leading indicator of where generative media is going overall. Sora-quality 60-second clips changed what's possible for indie filmmakers and ad agencies in a year.

What it means

Example

Why it matters

Related terms

See it in a comparison