All terms
Foundations

Tokenizer

Also known as: BPE, byte pair encoding, subword tokenizer

The component that splits raw text into tokens before the model sees it. Different models use different tokenizers, so the same word can cost more on one model than another.

What it means

A tokenizer is the preprocessing step that converts a string of text into a sequence of integer token IDs the model can actually consume. Most modern LLMs use Byte Pair Encoding (BPE) or a variant: start with raw bytes, then iteratively merge the most common adjacent pairs into single tokens until you have a vocabulary of around 50k–200k entries. The tokenizer is trained on a corpus before the model. Whatever languages, code patterns, and domain vocabulary appeared in that corpus get efficient single-token encodings; everything else gets shredded into byte-level pieces. This is why GPT models historically tokenize English very efficiently but burn 3x more tokens on Korean. Newer tokenizers (Claude's, Gemini's, GPT-5's) are noticeably better at multilingual and code tokenization than older ones. The same word can cost different amounts across providers because each has its own tokenizer. A 1,000-word English document might be 1,300 tokens for GPT-5, 1,250 for Claude, and 1,400 for an older Llama. For pricing comparisons, this 5-15% difference is real — though usually swamped by per-token rate differences. For context windows, it matters more: a "1M token" context isn't the same amount of actual text across two different tokenizers.

Example

OpenAI's tiktoken library lets you count tokens locally for GPT models. Anthropic exposes a similar count_tokens endpoint for Claude. Running the same paragraph through both will give you slightly different counts.

Why it matters

If you're building anything that estimates cost or fits content into a context window, you need to count tokens with the actual tokenizer of the target model — not approximate from word count. Off-by-tokenizer errors are a common cause of mysterious truncation and budget overruns.

Related terms