Diffusion model

What it means

Diffusion models generate by denoising. Training works in two phases: in the forward process, you take a real image and add Gaussian noise to it in many small steps until it's pure static. In the reverse process, you train a neural network (usually a U-Net or a diffusion Transformer) to predict the noise that was added at each step. To generate, you start from pure random noise and run the network in reverse, removing predicted noise step by step, until a coherent image emerges. This is fundamentally different from how LLMs work. LLMs are autoregressive — they generate one token at a time, left to right, and each token is conditioned on the previous ones. Diffusion models generate the whole image at once but iteratively refine it over many denoising steps (originally 1000, modern samplers do it in 4-50). Text conditioning gets injected at each step via cross-attention to a text encoder like CLIP or T5, which is why prompts work. Stable Diffusion, Flux, Midjourney, DALL-E 3, Imagen, and Ideogram are all diffusion models (or close variants). Sora and Veo are video diffusion models — same idea, but operating on noisy video volumes. Modern image diffusion mostly happens in a "latent space" produced by a VAE, which is much cheaper than denoising raw pixels — that's the trick that made Stable Diffusion run on consumer GPUs. There's recent crossover: diffusion language models exist (Mercury, LLaDA) and try to apply this paradigm to text generation, claiming faster sampling than autoregressive LLMs. They're interesting but haven't displaced standard LLMs yet. And new "flow matching" approaches (used in Flux and SD3) generalize diffusion with cleaner math. The space is moving fast.

Example

When you prompt Flux for 'a cyberpunk cat in neon rain,' it starts from random noise in the latent space, runs ~25 denoising steps guided by your text embedding, then decodes the final latent into pixels. The whole image emerges gradually, all at once — not left to right.

Why it matters

Diffusion is why image and video generation works at all. The mechanics are different enough from LLMs that the same intuitions don't transfer — prompt engineering for Midjourney is a different skill from prompt engineering for ChatGPT, and capabilities like editing, inpainting, and ControlNet only make sense once you understand the iterative denoising process.

What it means

Example

Why it matters

Related terms

See it in a comparison