What is a token in AI? The unit that controls cost and output
Tokens are how models read, generate, and bill. The mental model, why output costs more than input, why your AI bill is bigger than expected, and the 7 levers to cut cost without breaking quality.
What is a token in AI? The unit that controls cost and output
A token is the unit of text an LLM processes — usually a word, a part of a word, or a punctuation mark. Models read in tokens, generate in tokens, and bill in tokens. Understanding tokens is the difference between guessing why your AI bill spiked and knowing.
How text becomes tokens
Before a model sees your prompt, the text is broken up by a tokenizer. The tokenizer is a fixed lookup table that maps strings to numeric IDs. Common chunks get their own ID; uncommon ones get split into pieces.
A rough mental model for English:
- 1 token ≈ 4 characters
- 1 token ≈ ¾ of a word
- 100 tokens ≈ 75 words ≈ 1 short paragraph
- 1,000 tokens ≈ 750 words ≈ 1 page
- 10,000 tokens ≈ 7,500 words ≈ 1 short article
Languages other than English are usually less efficient — German, Japanese, and Arabic often need 2–3x more tokens for the same content. Code is somewhere in between, depending on the language.
You can see exactly how a given model splits text using a tokenizer playground (most providers offer one). Worth doing once for any text shape you'll be processing a lot — what looks like 200 words to you might be 500 tokens.
Why tokens are billed, not characters or words
Cost scales with compute, and compute scales with tokens, not characters. A token is one "step" through the model. A 5-letter token and a 1-letter token cost the same to process.
This is why pricing pages quote dollars per million tokens, not per page or per request. A typical commercial API in 2026:
- Input tokens: $1–$15 per million depending on model tier
- Output tokens: $5–$60 per million (output is more expensive than input)
Output costs more because the model is "deciding" each output token; input is just being read.
Why output tokens cost more
When a model generates output, it's running a full inference pass for every token it produces. One token at a time, with the full prompt + everything generated so far loaded in. This is more compute than a single pass to read your input.
For the same total token count, output cost can be 3–5x input cost. This has practical implications:
- A summarization task (long input, short output) is much cheaper than a generation task (short input, long output)
- Asking the model for terse responses isn't just about readability — it's billing optimization
- A chatbot that includes its full history with every turn has growing input costs, but if it also writes long responses, the output costs grow faster
Why your AI bill is bigger than you expected
The most common surprises:
1. You're paying for the whole conversation, every turn
When you send a 5th message in a chat, the model sees messages 1–4 again, then your new message. If each turn is ~500 tokens, by message 10 you're paying for ~5,000 input tokens per call. By message 30 it's 15,000.
This is why chat apps that retain history get expensive. There's no memory — every turn re-uploads the past.
2. System prompts are tokens too
A long system prompt (the instructions you set on top of every conversation) is added to every call. A 2,000-token system prompt × 10,000 calls = 20 million input tokens just from instructions.
3. RAG context can be massive
Retrieval-augmented generation pipes documents into the prompt. If you retrieve 5 chunks of 1,000 tokens each per query, you're paying for 5,000 extra input tokens per request before the model even sees the question.
4. Tool calls multiply token count
When a model uses tools (search, calculator, function calls), each tool invocation is a separate model call, with its own input and output tokens. A multi-step agent can spend 10–50x more tokens than a single chat response.
5. Retries on errors
If the model returns malformed output and your code retries, you pay for both attempts.
How tokens limit what fits in context
Every model has a context window — a maximum total token count for input plus output. Common sizes:
- Older / small models: 4K–32K tokens
- Current frontier (2026): 200K–1M tokens
A 200K context window sounds like a lot. It is — about 150,000 words, or a long novel. But it fills up fast:
- A 50-page PDF: ~30K tokens
- A modest codebase: 100K–500K tokens
- A year of chat history: easily 100K+
Once you exceed the window, the oldest content gets dropped or the request errors out. See Context windows explained for the deeper dive.
How to cut token cost without breaking output
In rough order of impact:
- Truncate conversation history to the last N messages instead of sending all of them
- Strip system-prompt fat — every word you don't need is paid for in every call
- Use the cheapest model that meets the quality bar — frontier models can be 10x the cost of competent smaller ones
- Cache repeated content — many providers offer prompt caching for content that doesn't change across calls (system prompt, RAG chunks)
- Cap output length —
max_tokensexists for a reason - Stream when possible — doesn't reduce token count, but lets you stop early if the output is going wrong
- Batch async work — providers offer batch APIs at ~50% the price for non-real-time tasks
See the LLM Cost + Quality Tuner skill for the structured exercise.
Why counting tokens is harder than counting words
The same text can produce different token counts in different models. OpenAI's tokenizer is different from Anthropic's, which is different from Google's. A prompt that's 1,000 tokens for GPT might be 1,200 for Claude.
For accurate counts, use the official tokenizer for your model:
- OpenAI:
tiktokenlibrary - Anthropic: their token-counting endpoint
- Google: their Vertex AI token counting
Don't estimate by character count when precision matters (budgeting, prompt design near a context limit). The rough rule "1 token = 4 chars" is good for ballparking, not for engineering.
What to read next
- Context windows explained: what they limit and what they don't — the related constraint on what fits
- How large language models work — what tokens are flowing through
- LLM Cost + Quality Tuner — practical cost reduction
Next in this pillar
Context windows explained: what they limit, what they do notGet the next guide when it lands
One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.