All terms
Modalities

ControlNet

Also known as: CN, structural conditioning

A Stable Diffusion technique that lets you condition image generation on a structural input — a depth map, pose skeleton, edge map, or scribble — alongside the text prompt.

What it means

ControlNet, introduced in 2023 by Lvmin Zhang, is the reason serious image work happens in the Stable Diffusion / Flux ecosystem rather than in Midjourney. It bolts an extra conditioning channel onto a diffusion model: instead of only telling the model "a knight on a horse," you also feed it a pose skeleton, a Canny edge map of a reference photo, or a depth map from a 3D render, and the generated image will respect that structure. The trick is architectural. ControlNet clones the encoder half of the base U-Net, freezes the original weights, and trains the clone on (image, structural-input) pairs. At inference, the clone's outputs are added to the frozen base via "zero convolutions" — layers initialized to zero so the model starts from the base behavior and learns to deviate. This lets a single base model gain dozens of independently-trained ControlNets (pose, depth, normal map, segmentation, scribble, line art, MLSD for architecture, etc.) without retraining the base. In practice this is what lets you generate the *same character in the same pose* across 20 frames, or composite a generated subject into a real photograph's geometry, or turn a rough doodle into a polished illustration. ControlNet only exists in the open ecosystem (Stable Diffusion, SDXL, Flux). Midjourney is closed — you can use image references and style refs, but you can't bolt on a depth-map conditioning the way you can in ComfyUI. That's the single biggest reason VFX studios, game artists, and product designers use SD/Flux pipelines instead.

Example

An illustrator imports a posed 3D mannequin into ComfyUI, runs depth + OpenPose ControlNets on it, and generates 30 variations of "warrior queen, ornate armor" — all in the exact pose they posed the mannequin in.

Why it matters

ControlNet is what separates 'rolling the dice on a prompt' from actually directing an image model. If you need consistency, composition control, or to integrate AI generation into an existing pipeline (3D, photo, video), ControlNet is non-negotiable — and it's the strongest argument for working in the SD/Flux ecosystem instead of Midjourney.

Related terms

See it in a comparison