All terms
Modalities

Latent space

Also known as: embedding space, latent representation

The compressed numerical representation that diffusion models actually work in — not pixels, but a smaller learned embedding. The 'Latent' in Stable Diffusion's name.

What it means

A latent space is a compressed learned representation of an image (or any signal). Instead of working with a 1024x1024x3 RGB tensor (over 3 million numbers), a latent diffusion model encodes the image into something like 128x128x4 (~65,000 numbers) using a variational autoencoder (VAE). Diffusion happens in that compressed space, then the result is decoded back to pixels at the end. The same idea applies to video, where a 3D VAE compresses across time as well. This is the entire reason modern image and video generation is fast and affordable. Pixel-space diffusion (denoising directly on the full-resolution image) is brutally expensive — you're doing 30-50 forward passes of a giant U-Net on millions of values. Latent diffusion does the same denoising on a representation that's 48x smaller, with most of the perceptually-important structure preserved. A 1024x1024 image takes 2-3 seconds on a consumer GPU instead of minutes. Stable Diffusion is literally named after this — "stable" referred to the latent space training, "diffusion" to the denoising process. The latent space also has interesting properties beyond speed. You can interpolate between two images by interpolating their latents and decoding (which is how morph videos work). Img2img works by encoding your reference image to a latent, partially noising it, and denoising back — controlling how much creativity vs faithfulness you get with the noise level. Inpainting masks operate in latent space. Most of the "creative tricks" in the diffusion world are really tricks for navigating this latent space, not pixel space.

Example

When Stable Diffusion generates a 1024x1024 image, the diffusion U-Net is actually denoising a 128x128x4 latent tensor for ~25 steps; only the final tensor is decoded to a full-resolution RGB image by the VAE.

Why it matters

Understanding latent space is what separates copy-paste prompt users from people who actually understand why img2img, inpainting, ControlNet, and LoRAs work the way they do. It's also why video generation became viable at all — pixel-space video diffusion would still be a research curiosity in 2026.

Related terms

See it in a comparison