All terms
Safety, eval & ops

Token cost

Also known as: per-token pricing, API pricing, $/MTok

How API providers price model usage — dollars per million tokens, with input tokens cheaper than output tokens.

What it means

Every API provider prices in tokens, almost always quoted per million tokens (per-MTok). Input tokens (your prompt + context) are cheaper than output tokens (the model's response), usually by 3-5x. The asymmetry exists because output is the bottleneck — a model can ingest a million-token prompt in seconds but generate output one token at a time on expensive accelerator memory. Mid-2026 ballpark pricing: Anthropic Claude Opus 4.7 = ~$15/$75 per M input/output, Sonnet 4.6 = ~$3/$15, Haiku 4.x = ~$0.80/$4. OpenAI GPT-5 = ~$10/$30, GPT-5 mini = ~$0.40/$1.60. Google Gemini 3 Pro = ~$1.25/$5, Gemini 3 Flash = ~$0.10/$0.40. xAI Grok 4 = ~$5/$15. Open-weight via inference providers (DeepSeek V3 on Together, Llama 4 on Fireworks) often runs $0.20-$1 per M tokens combined, which is why cost-sensitive apps route to them. The hidden costs to watch: prompt caching can drop input cost by 90% on cache hits — Anthropic and OpenAI both support it, but you have to architect prompts to be cache-friendly. Reasoning models bill thinking tokens as output tokens, so an o-series or extended-thinking call that outputs 500 visible tokens may have generated 5,000 billable thinking tokens. Batch APIs are 50% off but async only. For agentic apps, the dominant cost is usually the long input from re-sending tool-call history every turn — RAG with a vector DB beats stuffing everything into context for cost reasons too.

Example

A coding agent makes 20 turns averaging 8k input + 1k output per turn on Claude Sonnet 4.6: 20 × (8 × $3 + 1 × $15) / 1000 = ~$0.78 per session. Same workload on Haiku is ~$0.21. Over 100k sessions/month, that picks the model.

Why it matters

Token cost is the unit economics of any LLM product. A feature that costs $0.10 per use is fine at $20/mo SaaS; at $5/use it isn't. Most teams underestimate output-side costs (long responses, thinking tokens) and overestimate input-side costs (cacheable system prompts). Model the actual call patterns before scaling.

Related terms

See it in a comparison