All terms
Safety, eval & ops
Inference provider
Also known as: serverless inference, open-weight API, hosted open models
A third party that hosts open-weight models behind an API so you don't have to run GPUs yourself. Often cheaper than self-hosting at small or medium scale.
What it means
Inference providers are the missing middle between calling a closed API and standing up your own GPU cluster. They host open-weight models (Llama 4, DeepSeek V3, Qwen 3, Mistral, plus dozens of fine-tunes) on shared infrastructure and expose them behind OpenAI-compatible APIs. You get the price/openness benefits of open-weight models without the ops burden.
Major players in 2026: Together AI (broad model selection, fast deploys), Fireworks AI (low latency, custom-deployed models), Replicate (also handles image/video/audio models), Anyscale (Ray-native, enterprise focus), Groq (LPU hardware, ridiculously fast for supported models), Cerebras (wafer-scale, also fast), Lambda, RunPod, plus DeepInfra and OctoAI. Aggregators like OpenRouter, Portkey, and LiteLLM proxy across many of these so you can swap providers without code changes.
Trade-offs vs other options: cheaper than closed APIs (DeepSeek V3 on Together at ~$0.30/$0.30 vs Claude Sonnet at $3/$15), but quality ceilings are open-weight ceilings — top open models lag the frontier closed models by 6-12 months on hard reasoning tasks. Cheaper than self-hosting for most teams (no idle GPU cost, no ops), but rate limits and tail latency vary by provider since you share capacity. The sweet spot is high-volume, latency-tolerant workloads where the cost gap matters and Llama-tier quality is enough — RAG over docs, classification, content generation at scale, evaluation grading.
Example
A doc-summary product runs 50M tokens/day. On Claude Sonnet that's ~$45k/month. On Llama 4 70B via Together AI at ~$0.80/$0.80 per MTok, it's ~$2k/month — a 20x savings, with quality drop measurable but acceptable per their eval suite. They keep Sonnet for the 5% of hardest queries via a router.
Why it matters
Inference providers turn open-weight models from "interesting research project" into a real production option. If you only know about closed APIs, you're leaving 5-20x cost savings on the table for workloads that don't need frontier-tier quality. Building a router that mixes a closed frontier model and an open-weight inference provider is the standard cost-optimization pattern in 2026.