All terms
Modalities
Text-to-video
Also known as: T2V, video generation, AI video
Generative AI that turns a text prompt (or an image) into a short video clip. The leading systems in 2026 are Sora, Runway, Pika, Veo, and Kling.
What it means
Text-to-video does for moving images what text-to-image did for stills: you describe a scene, and the model synthesizes a clip. As of 2026 the main players are OpenAI's Sora 2, Google's Veo 3, Runway Gen-4, Pika, and Kuaishou's Kling. Clip lengths have crept up — Sora 2 and Veo 3 can produce coherent 60-second shots at 1080p, with reasonably stable characters and camera moves. A few years ago two seconds was a triumph.
Architecturally these are mostly diffusion transformers (DiTs) that operate on space-time latents — they compress video into a 3D latent grid and denoise it as a single tensor, rather than generating frames independently. That's how they get temporal coherence. Audio synthesis (Veo 3 generates dialogue and sound effects synced to the video) is a separate but tightly coupled model.
Text-to-video lags text-to-image for three reasons. First, video data is orders of magnitude larger and harder to label — there's no LAION-5B equivalent for video. Second, the compute cost is brutal: a 10-second clip is roughly 240 frames, each needing diffusion, and the model has to keep them coherent. Third, the failure modes are more visible — a slightly weird hand in a still image is forgivable; morphing fingers across 24 frames per second is not. Expect cost-per-clip to keep falling fast (a 5-second Sora clip is about $0.50 in 2026, down from $5 a year ago) and clip length to keep climbing.
Example
A marketer prompts Runway with "drone shot pulling back from a couple hiking on a misty mountain ridge at sunrise, cinematic" and gets a 10-second 1080p clip in about 90 seconds for $1.20.
Why it matters
Video is the format with the highest commercial value (advertising, film, social) and the steepest remaining quality gap. Tracking text-to-video progress is the best leading indicator of where generative media is going overall. Sora-quality 60-second clips changed what's possible for indie filmmakers and ad agencies in a year.