Developer pack
Claude Skill

RAG Architect

Designs a retrieval-augmented generation system end-to-end — chunking, embedding, retrieval, ranking, grounding — for your specific corpus.

What it does

Takes your knowledge corpus, expected query patterns, and quality bar, and produces a RAG architecture: chunking strategy, embedding model choice, vector store, retrieval method, re-ranking, prompt structure for grounding. Plus eval setup. Avoids the "we built RAG and it hallucinates" outcome that most first-time builders hit.

When to use

  • You're building an AI feature that needs to answer from your specific data
  • Existing RAG pipeline that returns plausible-sounding but wrong answers
  • You need source citations and the model is currently making them up

When not to use

  • You can fit your data in the context window — just stuff it in the prompt
  • You need to teach the model new behavior or vocabulary — that's fine-tuning territory, not RAG

Install

Download the .zip, then unzip into your Claude skills folder.

mkdir -p ~/.claude/skills
unzip ~/Downloads/rag-architect.zip -d ~/.claude/skills/

# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.

SKILL.md

SKILL.md
---
name: rag-architect
description: Use when designing or debugging a RAG (retrieval-augmented generation) system. Triggers on "RAG", "retrieval", "vector search", "ground the LLM", "answer from our docs".
---

# RAG Architect

RAG is conceptually simple (retrieve relevant chunks, give them to the LLM, ask it to answer using them) and operationally complex (chunking strategy, embedding choice, retrieval tuning, re-ranking, grounding prompts). Most "our RAG doesn't work" complaints come from one of 4–5 specific failure modes.

## Required inputs

1. **The corpus** — what data, how much, how often it changes
2. **Query patterns** — what questions users actually ask (real samples > guesses)
3. **Quality bar** — what counts as "right" (factual correctness? format adherence? citation required?)
4. **Latency budget** — how fast the response needs to be
5. **Hallucination tolerance** — is "I don't know" an acceptable answer when retrieval comes up empty?

## Architecture decisions

### Chunking
- **Sentence-level**: high precision, loses context
- **Paragraph-level**: middle ground, common default
- **Section-level**: keeps context, may exceed embedding-friendly sizes
- **Recursive with overlap**: hybrid, often the right call

Match chunk size to your corpus AND query type. Long answers need bigger chunks; fact lookups can use smaller.

### Embedding model
- **Commercial APIs**: OpenAI ada-3-small (cheap, fast), Voyage AI (often higher quality), Cohere embed-v3
- **Self-hosted**: BGE family, Mxbai-large
- Test on YOUR data — benchmarks lie. Use a held-out query+correct-source set to measure recall@K.

### Vector store
- **<100K vectors**: in-memory brute force is fine
- **100K-1M**: pgvector if you already run Postgres; Qdrant or Pinecone otherwise
- **>1M**: dedicated vector DB

### Retrieval
- Top-K = 5 to 10 typically
- Add metadata filters (tenant, date, document type) — don't rely on embedding alone
- Hybrid search (BM25 + embedding) often beats embedding alone

### Re-ranking
- Optional but often worth it: retrieve top-50, re-rank to top-5 with a cross-encoder
- Adds latency (200-500ms) but dramatically improves relevance

### Grounding prompt
- Show the retrieved chunks with source labels
- Instruct: "Answer ONLY from these sources. If the answer isn't here, say so."
- Require citations: "Cite the source ID for each claim."

## Eval setup
Build a small set of (query, expected answer, expected source chunk) before launch. Run it before every config change. See `how-to-evaluate-llm-output` for the broader pattern.

## 5 common failure modes
1. **Bad chunking** — chunks too small to contain the answer
2. **Wrong embedding model** — recall@10 is < 60% on your test set
3. **No metadata filtering** — irrelevant docs get retrieved
4. **Weak grounding prompt** — LLM ignores chunks and confabulates
5. **No eval** — can't tell if "fixes" improve anything

Example prompts

Once installed, try these prompts in Claude:

  • Design a RAG system for our 12,000 internal Confluence pages. Users will ask product questions. Quality bar: must cite source page, must say "I don't know" rather than hallucinate.
  • Our RAG returns chunks but the LLM ignores them and makes up answers. What's broken? [paste setup]