How RAG works, and when to use it
The pattern that grounds an LLM in your data: how each step works (chunking, embedding, retrieval, reranking, generation), when RAG wins over fine-tuning and long context, and the failure patterns most beginner systems hit. With a modern 2026 stack.
If you have ever built an AI feature that needs to answer questions about your own data — internal docs, customer records, a product catalog, a legal corpus — you have probably hit the same problem: the model is fluent about the world but knows nothing about your specific content. RAG is the standard solution to that problem.
This guide is the longer version of the RAG glossary entry. The glossary gives you the definition. This page tells you how each step actually works, when RAG is the right tool, when it is not, and the failure patterns most beginner systems run into.
What RAG does, in plain English
You have a body of documents the model has never seen. The user asks a question. RAG does three things:
- Find the chunks most relevant to the question. Not the whole corpus. Just the parts that matter.
- Paste those chunks into the prompt. As "context" before the user's question.
- Let the model write the answer using those chunks. With instructions like "answer only from the provided context."
That is the whole pattern. Everything else — vector databases, embedding models, reranking — is implementation detail that makes step 1 work well at scale.
How each step works
Step 1: Chunking
You can not feed a 500-page PDF to an embedding model. You split it into chunks first.
The standard default is ~500 tokens per chunk with ~50 tokens of overlap. Smaller chunks (200 tokens) give precise retrieval but lose context. Larger chunks (1500 tokens) carry context but dilute the embedding's signal. Overlap means the same sentence appears at the end of one chunk and the start of the next, so the model never misses information that fell on a boundary.
Where beginners get this wrong: chunking blindly by character count, ignoring document structure. A contract chunked by 500 characters can split a clause across three chunks. Better: chunk by section, paragraph, or heading where possible, then fall back to character count for long sections.
Step 2: Embedding
You convert each chunk into a vector — a list of 768 to 3,072 floating-point numbers — using an embedding model. The embedding represents the chunk's meaning, roughly. Chunks with similar meaning end up close together in vector space.
The 2026 defaults: OpenAI's text-embedding-3-large or Voyage's voyage-3 for English production work. Cohere's embed-english-v3 is also competitive. For multilingual content, Cohere's embed-multilingual-v3 or OpenAI's larger model.
You embed every chunk in your corpus once, store the vectors, and never re-embed unless the chunk changes or you switch models. Re-embedding 5 million chunks costs real money — pick a model you can live with for a year.
Step 3: Storage and retrieval
You need somewhere to put millions of vectors and ask "give me the 20 closest to this query vector" in milliseconds. That is what vector databases do.
Practical 2026 choices:
- pgvector — if you already use Postgres and have under ~1 million vectors. Cheapest option. Performance is fine.
- Pinecone — managed, fastest to set up, costs more. Good if you don't want to operate a database.
- Qdrant or Weaviate — self-hosted production workloads. Good control, more operational burden.
- Chroma — prototyping and small projects. Not production-ready at scale.
At query time, you embed the user's question with the same model you used for the corpus, then ask the database for the top-K nearest vectors.
Step 4: Hybrid search
Pure vector search has a known weakness: it is bad at exact-match queries. If the user types error E-1042, the embedding might happily return articles about other error codes that "feel similar."
The fix is hybrid search: combine the vector similarity score with a keyword score (usually BM25). You retrieve the top-N from each, then merge the rankings. Modern vector DBs (Qdrant, Weaviate, recent Pinecone, pgvector with tsvector) support this natively.
Hybrid search is not optional in production. Almost every RAG failure on technical queries (codes, IDs, citations) traces back to skipping it.
Step 5: Reranking
You retrieved 20 chunks. The model only gets the top 5. How do you pick which 5?
Reranking runs a second, more expensive model over the 20 retrieved candidates and re-scores them based on the question. Cross-encoder models (Cohere's rerank-3, BGE rerankers) read each candidate alongside the question and output a relevance score. The top 5 by reranker score are what you send to the LLM.
Skipping reranking is the single most common reason beginner RAG systems feel "almost there but not quite." The retriever is fast and approximate; the reranker is slow and precise. Use both.
Step 6: Generation
The final step. You give the LLM a system prompt that says something like:
You answer questions using only the provided context. If the context
does not contain the answer, say "I don't have that in the provided
sources." Cite the source [1], [2] etc. in your answer.
Context:
[chunk 1]
[chunk 2]
[chunk 3]
...
Question: <user question>
The model writes the answer. With explicit "answer only from context" instructions, it cites and stays grounded. Without those instructions, it can pull from training data and contradict your documents.
When to use RAG
RAG is the right tool when:
- The data is bigger than the model's context window, or would be if you included everything potentially relevant.
- You need citations showing which source produced which claim.
- The data changes more often than you would want to retrain a model.
- You want per-user or per-tenant scoping — RAG retrieves only the chunks the user has permission to see.
- The data is not in the model's training set (internal docs, customer data, fresh content).
Most "chat with your docs" products are RAG. Most internal knowledge bots are RAG. Most customer support copilots are RAG.
When NOT to use RAG
RAG is not always the right answer.
- The whole corpus fits comfortably in context. A 10-page handbook does not need retrieval. Just paste it.
- You need to teach the model behavior or style, not facts. A model that needs to write in your house style is a fine-tuning problem, not a RAG problem.
- The task does not depend on external knowledge. Math, code transformation, structured extraction — these need a good prompt, not retrieval.
- The data is structured and you have a search index already. If you can answer the question with a SQL query or a Postgres lookup, do that. RAG over structured data is often worse than just querying the structured data.
RAG vs the alternatives
RAG vs fine-tuning
Fine-tuning bakes the data into model weights. Pros: no retrieval step, faster at inference. Cons: re-training is expensive, the data is locked in, you can not delete or update specific facts cleanly, you get no citations, you can not scope by user permission.
Use fine-tuning for: tone, style, format, behavior. "Always respond in three bullets." "Never recommend competitors." "Use formal Swedish."
Use RAG for: knowledge. "What is our refund policy?" "What did this contract say in Section 4?" "How does our API handle pagination?"
The two are not exclusive. Production systems often fine-tune for tone and RAG for knowledge.
RAG vs long context
Modern frontier models support 200K to 1M tokens of context. So why not just paste everything?
Three reasons RAG still wins for most real corpora:
- Cost and latency. Sending 500K tokens per query is expensive and slow. RAG sends maybe 5K. Same answer quality, 100x cheaper.
- Noise dilutes attention. Even with a huge context, models do worse when 99% of the context is irrelevant. RAG filters to the 1% that matters.
- Audit trails. RAG gives you "the model answered using these specific chunks." Long-context gives you "the model read everything and answered."
Long context wins when the corpus is small enough to fit comfortably, when every word might matter, or when the question requires reasoning across the entire document (not just retrieving facts from it).
RAG vs plain search
A search engine returns documents. RAG returns answers.
If your users want to find the right document and read it themselves, you want search. If they want a direct answer to a question, with the document as supporting evidence, you want RAG.
In practice many products do both: a search box that returns results, and an "ask AI" button that runs a RAG flow over the same index.
Common mistakes in beginner RAG systems
A short list of patterns that cause RAG demos to fail in production:
- No hybrid search. Pure vector search misses exact-match queries. Add BM25 from day one.
- No reranker. Top-K retrieval is approximate. Without a reranker, you send mediocre chunks to the LLM and the LLM does its best with them — confidently.
- Chunks too big or too small. 500 tokens with 50-token overlap is a reasonable default. Tune from there.
- Chunking without document structure. Splitting by character count alone destroys legal documents, technical specs, and anything with hierarchical headings.
- No "answer only from context" instruction. Without it, the model mixes retrieved content with training data and you lose grounding.
- No "I don't know" path. If the model is not told it can say "the context does not contain this," it confabulates. Always allow the refusal.
- No metadata filtering. Searching the whole corpus when the user is asking about a specific product, time period, or team wastes retrieval slots on irrelevant chunks.
- Same embedding for query and corpus, but no query rewriting. User queries are short and underspecified. Document chunks are long and detailed. The embedding space matches them poorly. Query rewriting (asking an LLM to expand the question first) helps.
- Re-embedding the corpus every deploy. Embeddings cost money. Embed once, store, only re-embed when the chunk content changes.
- No evaluation. "It seems to work in the demo" is not a quality signal. Build a small eval set of 50 questions with known correct answers and run it on every change.
A minimal modern RAG stack
If you are building a new RAG system in 2026, this is a reasonable default:
- Documents → chunked by section/paragraph, ~500 tokens, 50-token overlap
- Embeddings → OpenAI
text-embedding-3-largeor Voyagevoyage-3 - Storage → pgvector if small (< 1M vectors), Pinecone or Qdrant if larger
- Retrieval → hybrid search (BM25 + vectors), top-20
- Reranking → Cohere
rerank-3or BGE reranker, top-5 - Generation → Claude Sonnet 4.6 or GPT-5-class, with "answer only from context" system prompt
- Evaluation → small hand-built eval set + automated re-run on every change
Build this stack, watch where it fails on your data, fix the specific failure rather than swapping components randomly.
Where to go next
- Short definition + the 2026 modern stack as a quick reference: /glossary/rag
- Related concepts: vector databases, semantic search, chunking, reranking, hybrid search
- If you are picking an embedding model: Picking your daily AI covers model selection logic (the same logic applies to embedding model choice)
- If you are verifying the answers your RAG system produces: How to verify AI output before you trust it