All terms
RAG & retrieval

Chunking

Also known as: document splitting, text segmentation

Splitting source documents into smaller pieces before embedding so each chunk is small enough to retrieve precisely and fit in the prompt.

What it means

Embedding models have token limits (typically 512 to 8,192 tokens), and even when they don't, embedding a whole 200-page PDF as one vector loses all locality — the vector becomes an average of everything and matches nothing well. So you chunk: split the document into smaller units, embed each one, and index them separately. This unsexy step quietly decides whether your RAG system works. Common strategies, roughly in order of sophistication: fixed-size chunking (every 500 tokens, with maybe 50-token overlap to avoid cutting mid-thought) is the default and surprisingly hard to beat. Sentence- or paragraph-boundary chunking respects natural breaks and keeps ideas intact. Recursive character splitting (LangChain's classic) tries paragraph, then sentence, then word, falling back as needed. Semantic chunking uses an embedding model to find topic shifts and splits there. Document-aware chunking respects structure — headings, tables, code blocks, list items. The hard problem is context loss. A chunk in the middle of a contract that says "the party shall pay $50,000 within 30 days" has no idea who "the party" is or what contract it's from. Solutions in 2026: prepend section headers and document titles to each chunk, store metadata (doc_id, section, page) for filtering and citation, generate "contextual chunks" where an LLM rewrites each chunk with a one-sentence summary of where it came from, or use late chunking where you embed the whole doc first and pool slices. There's no universally right chunk size. ~200 tokens is good for precise factoid retrieval (Q&A over a wiki), ~1,000 tokens is better for summarization or contextual reasoning, and code/legal often want larger units that respect structure. Test on your eval set, not vibes.

Example

A 50-page employee handbook is split into ~400 chunks of 500 tokens each with 50-token overlap. Each chunk is prepended with the document title and section heading before embedding, so a retrieved chunk knows it came from "Section 7: PTO Policy."

Why it matters

Chunking decides what your retriever can even see. Bad chunks hide the answer (split across two pieces, neither complete) or drown it (one chunk per chapter, so retrieval is too coarse). Most RAG systems that 'don't work' have a chunking problem, not a model problem.

Related terms