Context windows explained: what they limit, what they do not
How much text a model can consider at once, what counts against the window, why quality degrades long before you hit the limit (the "lost in the middle" effect), and how to budget context in production.
Context windows explained: what they limit and what they don't
A context window is the maximum amount of text an LLM can consider in a single request — measured in tokens. Everything the model needs to "see" to answer your question has to fit: your system prompt, conversation history, any documents you've passed in, your current message, and the space reserved for the model's response.
If you exceed the window, something has to be dropped. Knowing what gets dropped (and what doesn't), and what happens to quality long before you hit the limit, is most of what makes "long context" usable in practice.
What sizes are common in 2026
| Model class | Typical window |
|---|---|
| Smaller / older models | 4K–32K tokens |
| Current commercial (mid-tier) | 128K–200K tokens |
| Current frontier | 200K–1M tokens |
| Specialized long-context | 1M–10M tokens (early-stage, limited availability) |
A 200K token window holds roughly 150,000 words — a long novel. A 1M token window holds about 750,000 words — a small library. These numbers grew about 100x in three years, and they're still growing.
What counts against the window
Everything the model sees. In a typical chat API call:
- System prompt — the instructions you set on top of every conversation
- Conversation history — all prior turns, every time
- Current user message
- Any tool definitions or function schemas you exposed
- Any retrieved documents (in a RAG setup)
- Reserved output budget — most APIs require you to specify a max output length, and that's reserved space within the window
If your system prompt is 5K tokens, history is 20K, retrieved docs are 30K, and you've reserved 4K for output, you've already used 59K of a 128K window before the model has done anything.
What happens when you exceed the window
Two failure modes, depending on the provider:
- Hard error: the request fails with "context length exceeded." You see it and can handle it.
- Silent truncation: the provider quietly drops content (usually oldest first) to fit. The request succeeds but the model didn't see everything you sent.
Silent truncation is the worse failure because the model still produces an answer — just one that's missing crucial context. It looks fine until it doesn't. Always check what your provider does at the limit, and prefer providers that error loudly.
Why quality degrades before you hit the limit
A 200K context window doesn't mean the model uses all 200K equally well. Models pay more attention to:
- The beginning of the context (especially the system prompt)
- The end of the context (especially the current question)
Material in the middle of a long context — what researchers call the "lost in the middle" effect — is often retrieved less reliably. A fact in the middle of 100K tokens of input is more likely to be missed than the same fact at the start or end.
Practical implication: for long-context queries, put your question at the end, after the documents. Don't sandwich documents between your question and a "and please summarize the above." The standard format that works:
[system prompt: instructions]
[long documents]
[your specific question, last]
Studies have measured 30%+ quality improvement just from putting the question after the context, for very long inputs.
When the conversation gets long: what gets dropped
Most chat interfaces handle history by sending the most recent N messages and silently dropping older ones. This is why a long ChatGPT conversation eventually "forgets" what you discussed at the start.
If you're building on the API, you have options:
- Sliding window: keep the last N messages, drop the oldest
- Summarization: periodically summarize old turns into a condensed system message
- Selective retention: keep messages tagged "important" (decisions, facts to remember) and drop the small talk
- Move to RAG: dump the conversation into a searchable store and retrieve relevant past turns per query
Each has tradeoffs. Sliding window is simplest. Summarization preserves more but loses fidelity. RAG scales infinitely but adds latency.
Why "fits in context" doesn't equal "uses well"
Even with a 1M token window, you'll hit practical issues before you hit the technical limit:
- Latency: bigger contexts take longer to process. A 200K context request can take 30+ seconds to first token; a 1M request can take minutes.
- Cost: input tokens are billed. Filling a 1M window per query is expensive.
- Attention degradation: as discussed, the middle of long contexts gets less reliable retrieval.
- Reasoning depth: models can struggle to reason across a very long context as fluidly as a short one.
The right tool for a 500K-token document isn't always "stuff it all in context." Often it's RAG (retrieve the relevant chunks per query) or progressive summarization (compress the document into a working summary first, then ask questions of the summary).
See How RAG works, and when to use it for the alternative.
When a long context window earns its keep
- Single-document Q&A on a long document (legal contract, research paper, codebase file)
- Conversation that needs to retain a lot of state (long support thread, complex multi-turn agent task)
- Few-shot prompts with many examples (10+ examples, each non-trivial)
- Agentic loops where each step adds tool output to context
When you don't:
- Short Q&A with no document context
- Tasks where RAG would surface the relevant content faster
- Pure code generation tasks under 5K tokens
How to budget your context window
A practical workflow when designing a prompt or pipeline:
- Measure the static parts: system prompt + any tool definitions + reserved output budget
- Estimate the dynamic parts: typical user input + typical retrieved content + history
- Sum and leave 20% headroom: unexpected long inputs happen; don't run flush with the limit
- Decide what gets dropped first when you exceed: oldest history, or least-relevant RAG chunk, or both
- Log token counts in production: you'll find surprises (long pastes, edge cases) you didn't anticipate
If your typical usage is using more than 50% of your context window, you're at risk of edge cases blowing through. Either upgrade to a larger window or restructure.
What to read next
- What is a token in AI? — the unit being counted
- How RAG works, and when to use it — the alternative when context isn't enough
- LLM Cost + Quality Tuner — practical cost reduction including context-window discipline
Next in this pillar
Embeddings explained: how AI represents meaning as numbersGet the next guide when it lands
One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.