All terms
RAG & retrievalUpdated 53 days ago

RAG (Retrieval-Augmented Generation)

Also known as: retrieval-augmented generation, grounded generation

A pattern that grounds an LLM in your data: at query time, retrieve the most relevant chunks of text, paste them into the prompt, and let the model answer from those chunks instead of from its training.

What it means

RAG stands for Retrieval-Augmented Generation. It is the default pattern for getting an LLM to answer questions about content it was not trained on — your company wiki, a legal corpus, last week's Slack threads, internal product docs. How it works in one sentence: index your documents into a vector database, embed the user's question, pull the top-K nearest chunks, paste those chunks into the prompt as context, and let the model generate an answer that uses them. RAG vs fine-tuning. Fine-tuning bakes information into model weights — when the data changes, you re-train. RAG keeps the data in a separate retrievable layer, so when a policy changes, you re-index. RAG also gives you citations (you can show which chunk fed the answer) and works on day-zero data the model has never seen. Use fine-tuning for behavior and style; use RAG for knowledge. RAG vs long context. A 1-million-token context window means you can paste a lot in directly. RAG still wins when the document set is larger than the context, when you need citations, when latency matters, or when most of the corpus would be noise on any given query. Long context wins for short corpora where every word might matter. RAG vs plain search. Search returns documents; RAG returns answers. The model reads the retrieved chunks and writes the response in the form the user asked — a summary, a comparison, a recommendation — instead of just linking to sources. Common failure modes. Chunks lose context (a paragraph in the middle of a contract has no idea it belongs to Section 4). Embedding-only retrieval misses exact-match queries (product SKUs, error codes, legal citations). The retriever pulls 20 plausible-but-wrong chunks and the model confidently confabulates from them. Production RAG layers in hybrid search, reranking, query rewriting, and structured metadata filters to address these. A modern RAG stack (2026). Chunk documents into ~500-token pieces with overlap → embed with Voyage or OpenAI text-embedding-3 → store in Pinecone, Qdrant, or pgvector → retrieve top-20 with hybrid search (BM25 + vectors) → rerank to top-5 with a cross-encoder → send to Claude or a GPT-4-class model with a system prompt that says "answer only from these documents." For the full teaching version — how each step actually works, when to use RAG, when not to, and the failure patterns to watch for in beginner systems — read [How RAG works, and when to use it](/learn/ai/build-with-ai/how-rag-works-and-when-to-use-it).

Example

A customer-support bot answers "What is your refund policy?" by retrieving 5 chunks from the help center, pasting them into the prompt, and letting Claude write the answer with inline citations to the source articles.

Why it matters

RAG is how almost every 'chat with your docs' product works under the hood. If you're building anything that needs to answer questions about non-public data — internal knowledge bases, customer data, real-time information — RAG is the default architecture. Understanding its failure modes is the difference between a demo and a system people trust.

Related terms

See it in a comparison

Recent changes
  • Apr 22, 2026Added section on agentic RAG vs naive retrieval.