Mixture of Experts (MoE)

What it means

In a dense Transformer, every parameter fires for every token. MoE replaces the feed-forward layer in each block with a set of "experts" — typically 8 to 256 — and a small router that decides which 1-2 experts handle each token. Only the activated experts do work. So a model can have hundreds of billions of total parameters but use only a fraction of them per token. DeepSeek-V3 is the canonical 2025 example: 671B total parameters, but only 37B activate per token. Mixtral 8x7B has 47B total but routes to 2 of 8 experts, so ~13B are active. This is why MoE feels like a free lunch — you get the knowledge capacity of a huge model at the inference cost of a much smaller one. GPT-4, Gemini, and most frontier models in 2025-2026 are believed to be MoE, though most labs don't confirm details. The catch is real, though. MoE models are harder to train (load balancing across experts is tricky, you can get "dead" experts no router uses), harder to serve efficiently (you need to keep all experts in GPU memory even though most aren't running), and benchmark results suggest dense models are still slightly stronger per-parameter at the same active count. MoE wins on cost, not necessarily on raw quality at fixed compute. The other tradeoff: VRAM. A 671B MoE doesn't fit on a single GPU even though it's "only" computing 37B per token. That's why DeepSeek-V3 needs a multi-GPU setup or aggressive quantization to run locally — the full expert pool has to be resident somewhere.

Example

DeepSeek-V3 has 671B total parameters but only 37B active per token, which is why it competes with GPT-4-class models while being dramatically cheaper to serve.

Why it matters

MoE is why open-weight models suddenly got competitive with frontier closed models in late 2024. It's also why API prices keep falling — providers can scale capability with sparse activation instead of pure dense scaling.

What it means

Example

Why it matters

Related terms

See it in a comparison