Self-hosting

What it means

Self-hosting means downloading model weights (Llama 4, DeepSeek V3, Qwen 3, Mistral Large 3, etc.) and serving them yourself on GPUs you rent or own. The runtime is usually vLLM, SGLang, TensorRT-LLM, or Hugging Face TGI. The hardware is usually H100s, H200s, MI300Xs, or — for smaller models — consumer GPUs and Apple Silicon. The pitch is real but narrower than people think. Self-hosting wins on three axes: cost at high steady volume (a saturated H100 cluster running Llama 4 70B beats per-token API pricing once you cross ~10M tokens/day), data residency (regulated industries that can't send data to third-party APIs), and customization (LoRA fine-tunes, custom sampling, inference-time tricks the API doesn't expose). Privacy matters but is increasingly handled by enterprise API agreements with no-training, no-retention clauses. The pitch breaks down for most teams. Self-hosting Llama 3 70B at production quality and reliability costs more than you think — H100 hourly ($2-4 spot, $4-8 on-demand) plus ops headcount, plus capacity buffer for traffic variance, plus eval pipelines, plus the salary of someone who knows vLLM tuning. Hosted API pricing has dropped fast (DeepSeek V3 on Together at ~$0.30/$0.30 per MTok crushes anything you can self-host below ~5M tokens/day). For most teams, self-hosting is the wrong default — start with API, switch only when unit economics or compliance force the move.

Example

A healthcare company can't send PHI to OpenAI under HIPAA constraints. They self-host Llama 4 70B on 8x H100s in their own VPC, achieving ~5k requests/min at a fully-loaded cost of ~$25k/month. At their volume that's breakeven with API pricing, but the compliance constraint was non-negotiable.

Why it matters

Self-hosting is the answer to "what do you do when API providers change pricing, deprecate models, or can't legally serve your data." It's strategic insurance more than cost optimization for most companies. The skill matters even if you don't do it today — open-weight quality keeps closing the gap with frontier closed models.

What it means

Example

Why it matters

Related terms

See it in a comparison