All terms
RAG & retrieval

RAG (Retrieval-Augmented Generation)

Also known as: retrieval-augmented generation, grounded generation

A pattern that grounds an LLM in your data: retrieve the most relevant chunks of text at query time, paste them into the prompt, then let the model answer.

What it means

RAG is the dominant pattern in 2026 for getting an LLM to answer questions about content it wasn't trained on — your company wiki, a 5,000-page legal corpus, last week's Slack threads. The pipeline is simple in theory: index your documents into a vector database, embed the user's question, pull the top-K nearest chunks, stuff them into the prompt as context, and let the model generate the answer. The reason RAG beats fine-tuning for most knowledge problems is that the data layer is editable. Fine-tuning bakes information into model weights — when a policy changes, you re-train. With RAG, you re-index. RAG also gives you citations (you know which chunk fed the answer) and works on day-zero data the model has never seen. Fine-tuning is for behavior and style; RAG is for knowledge. Naive RAG fails in predictable ways. Chunks lose context — a paragraph in the middle of a contract has no idea it belongs to Section 4. Embedding-only retrieval misses exact-match queries (product SKUs, legal citations, error codes). The retriever pulls 20 plausible-but-wrong chunks and the model confidently confabulates from them. Production systems in 2026 layer in hybrid search, reranking, query rewriting, and structured metadata filters to make RAG actually work. A typical modern RAG stack: chunk documents into ~500-token pieces with overlap, embed with a model like Voyage or OpenAI text-embedding-3, store in Pinecone or Qdrant, retrieve top-20 with hybrid search (BM25 + vectors), rerank to top-5 with a cross-encoder, then send to Claude or GPT-4-class models with a system prompt that says "answer only from these documents."

Example

A customer-support bot answers "What is your refund policy?" by retrieving 5 chunks from the help center, pasting them into the prompt, and letting Claude write the answer with inline citations to the source articles.

Why it matters

RAG is how almost every 'chat with your docs' product works under the hood. If you're building anything that needs to answer questions about non-public data — internal knowledge bases, customer data, real-time information — RAG is the default architecture. Understanding its failure modes is the difference between a demo and a system people trust.

Related terms

See it in a comparison