Tokenizer — what it means in plain English

What it means

A tokenizer is the preprocessing step that converts a string of text into a sequence of integer token IDs the model can actually consume. Most modern LLMs use Byte Pair Encoding (BPE) or a variant: start with raw bytes, then iteratively merge the most common adjacent pairs into single tokens until you have a vocabulary of around 50k–200k entries. The tokenizer is trained on a corpus before the model. Whatever languages, code patterns, and domain vocabulary appeared in that corpus get efficient single-token encodings; everything else gets shredded into byte-level pieces. This is why GPT models historically tokenize English very efficiently but burn 3x more tokens on Korean. Newer tokenizers (Claude's, Gemini's, GPT-5's) are noticeably better at multilingual and code tokenization than older ones. The same word can cost different amounts across providers because each has its own tokenizer. A 1,000-word English document might be 1,300 tokens for GPT-5, 1,250 for Claude, and 1,400 for an older Llama. For pricing comparisons, this 5-15% difference is real — though usually swamped by per-token rate differences. For context windows, it matters more: a "1M token" context isn't the same amount of actual text across two different tokenizers.

Example

OpenAI's tiktoken library lets you count tokens locally for GPT models. Anthropic exposes a similar count_tokens endpoint for Claude. Running the same paragraph through both will give you slightly different counts.

Why it matters

If you're building anything that estimates cost or fits content into a context window, you need to count tokens with the actual tokenizer of the target model — not approximate from word count. Off-by-tokenizer errors are a common cause of mysterious truncation and budget overruns.

What it means

Example

Why it matters

Related terms