Quantization

What it means

Modern model weights are usually stored as 16-bit floating point numbers (FP16 or BF16). Quantization replaces those with smaller integer representations: 8-bit, 4-bit, sometimes even 2-bit. A 70B model in FP16 needs 140GB of VRAM. Quantized to INT4 it needs about 35GB — small enough to fit on one consumer GPU. The model's behavior shifts slightly because you're rounding every weight, but well-designed schemes (GPTQ, AWQ, GGUF's Q4_K_M, FP8) lose surprisingly little quality. The tradeoff is real but rarely dramatic. INT8 quantization typically loses 1-3% on benchmarks. INT4 loses more, sometimes 3-7%, and quality drops faster on smaller models — a quantized 7B feels noticeably dumber, while a quantized 70B is barely distinguishable. New techniques (FP8 training, calibration-aware quantization, QAT) keep narrowing the gap. This is the technology that makes open-weights models actually usable. DeepSeek-V3 at 671B parameters is theoretically too big for almost anyone, but quantized to 4 bits it runs on a $10K Mac Studio. Llama 4 405B at INT4 fits on two H100s. Quantization is also baked into how frontier APIs serve cheaper tiers — GPT-5 mini and Gemini Flash are partly small models, and partly quantized large models running on cheaper hardware.

Example

A developer downloads DeepSeek-V3 from Hugging Face, picks the Q4_K_M GGUF file (about 380GB instead of the original 1.4TB), and runs it locally on a workstation with two RTX 6000 Ada GPUs at 30 tokens per second.

Why it matters

Quantization is the single biggest reason 'I can run a frontier-class model on my own hardware' went from fantasy to reality between 2023 and 2026. If you care about running open models locally, choosing a serving stack, or estimating GPU costs, quantization tradeoffs are unavoidable. It's also why benchmark numbers vary between hosted providers — different quantizations, different scores.

What it means

Example

Why it matters

Related terms

See it in a comparison