All terms
Safety, eval & ops
Rate limit
Also known as: RPM/TPM limits, quota, 429 errors
A provider-imposed cap on how many requests or tokens you can use per minute or day. The thing that breaks production at scale.
What it means
Rate limits come in two flavors: RPM (requests per minute) and TPM (tokens per minute), often with a daily token cap on top. Every major provider tiers them by spend and account age. A new OpenAI account on Tier 1 might get 500 RPM / 200k TPM on GPT-5; a Tier 5 enterprise account gets 10k RPM / 30M TPM. Anthropic, Google, and xAI all have similar tier ladders.
Rate limits matter most at three points: cold-start launch (your free trial goes viral, you 429 on day one), bursty workloads (overnight batch jobs hit the daily cap), and agent fan-out (one user request becomes 50 model calls via tool use, multiplying load). The 429 Too Many Requests error in your logs is the canonical symptom. Frontier providers all return rate-limit headers (x-ratelimit-remaining-tokens, etc.) so you can self-throttle, but most SDKs don't surface them by default.
Standard mitigations: exponential backoff with jitter on 429, request distribution across multiple API keys or providers (OpenRouter, LiteLLM, and Portkey help here), batch APIs for non-urgent work (separate quota, 50% cheaper), and a pre-flight check on token estimates before calling. For real production scale, talk to your provider's sales team — Tier 5+ limits are negotiable, Anthropic and OpenAI both offer dedicated capacity contracts. Self-hosting eliminates rate limits but creates a different scaling problem (your own GPU capacity).
Example
A startup builds an email summarizer that fans out 10 Claude calls per inbox sync. At 1000 active users syncing every 5 minutes, that's 2000 RPM peak. They hit the Tier 2 limit (1000 RPM) within a week of launch, get 429s, and spend a sprint moving to a multi-provider router and batch API for non-realtime calls.
Why it matters
Rate limits are the most common production-incident cause for AI apps. Plan for them: instrument 429s as a first-class metric, design for graceful degradation (queue, retry, fallback to a cheaper model), and pre-negotiate higher tiers before launch traffic arrives, not after.