Latency

What it means

Latency in LLM apps has two numbers that matter. TTFT (time to first token) is how long until the model starts streaming output — what the user perceives as "responsiveness." Total time is TTFT plus output generation, governed by the per-token throughput of the deployment (typically 50-200 tokens/sec for frontier models in 2026). For a chat UI, TTFT under 1s feels instant; 1-3s is noticeable but acceptable; 3s+ feels slow. Streaming is mandatory because a 500-token response at 80 tok/s takes ~6 seconds total — fine if the user is reading along, painful if they wait for it all. For agent loops with no human in the loop, total time is what matters since the next step blocks on the previous one. Reasoning models invert the trade-off. Claude Opus 4.x with extended thinking, OpenAI o-series, DeepSeek R1, and Gemini Thinking can spend 30 seconds to several minutes on internal reasoning before producing visible output. That's by design — they trade latency for accuracy. For interactive UIs, this is a UX problem (show "thinking..." state, stream the reasoning if available). For batch tasks, the latency is fine because quality goes up. Choose model tier (Haiku-class fast vs Opus-class smart vs reasoning-tier deep) based on which axis your app needs.

Example

Claude Haiku 4.x typically hits TTFT around 200-400ms with throughput near 150 tok/s. Claude Opus 4.7 with extended thinking can take 20-90s before any visible output. Same API, two completely different UX patterns required.

Why it matters

Latency is what users feel. A model that's 5% smarter but 3x slower will lose to the faster one in most consumer products. For agents and batch pipelines, latency compounds — a 10-step agent at 5s/step takes a minute. Know your tail latency (p95/p99), not just the average; LLM APIs have heavy tails.

What it means

Example

Why it matters

Related terms

See it in a comparison