Fine-tuning vs RAG vs prompting: which one fits your problem
The three ways to make a model behave better for your case — cost, persistence, updateability, when to use each, and when to mix them. With the decision matrix and the math for "is fine-tuning worth it."
Fine-tuning vs RAG vs prompting: which one fits your problem
These are the three ways to make an AI model behave better for your specific use case. They solve different problems, cost different amounts, and fail in different ways. Picking the wrong one is the most common reason AI projects miss their goals.
The short version:
- Prompting = tell the model what to do in the moment. Cheapest, fastest, weakest persistence.
- RAG (retrieval-augmented generation) = look up relevant information at query time and include it in the prompt. Good for knowledge that changes, scales to large data, leaves the model itself unchanged.
- Fine-tuning = update the model's weights by training it on your data. Strongest persistence, highest cost, hardest to update.
Most production AI systems use a mix of all three.
When to use prompting
Prompting is the default. You write better instructions, give examples, structure the prompt — and the model behaves better. No infrastructure, no training run, no data pipeline.
Use prompting when:
- The task can be specified clearly in instructions
- You can fit the relevant context in the prompt
- The right behavior changes often (a hardcoded fine-tune would go stale)
- You're iterating fast and don't want a 3-day training loop between attempts
- The cost of getting it wrong is low
Prompting hits a wall when:
- The instructions get so long they crowd out useful context
- You need behavior the base model resists (specific tone, specific format with no examples to give)
- You're sending the same long system prompt millions of times (cost adds up; consider caching or fine-tuning)
See What makes a prompt work for the depth on prompting itself.
When to use RAG
RAG addresses a specific gap: the model doesn't know your data. Either because your data is private (internal docs, customer records, your codebase), or because it's recent (newer than the training cutoff), or because there's too much of it to fit in any context window.
A RAG system:
- Stores your data as searchable chunks (usually with embeddings — see Embeddings explained)
- At query time, retrieves the chunks most relevant to the question
- Stuffs those chunks into the prompt as context
- Asks the LLM to answer using them
Use RAG when:
- The model needs to answer based on your specific data (docs, knowledge base, transcripts, codebase)
- That data changes often — updating a search index is easy; fine-tuning is hard
- The data is too large to fit in any context window
- You need citations — RAG can tell the user which source chunk the answer came from
- You're dealing with regulated content where the LLM hallucinating an answer is unacceptable and falling back to "I don't have that information" is okay
RAG hits limits when:
- The relevant information is scattered across many chunks (retrieval misses the big picture)
- The task requires reasoning across many documents at once (not "look up X" but "synthesize X across 50 sources")
- The chunks contain misleading or contradictory information and the model can't tell which to trust
- Latency matters and the retrieval step adds 200–800ms
See How RAG works, and when to use it for the deeper RAG explanation.
When to use fine-tuning
Fine-tuning changes the model itself. You take a base model, train it further on your data, and the resulting model has internalized something about that data (style, knowledge, format, behavior).
Use fine-tuning when:
- You need a specific behavior persistently — every output, no need to re-specify in the prompt
- You have high-quality training data in the format you want the model to imitate (hundreds to thousands of examples minimum)
- The base model resists prompting for your behavior — you've tried better prompts and they're not enough
- You're running enough volume to justify the cost (training and hosting a fine-tuned model is more expensive per token than the base model API)
- Latency matters — a smaller fine-tuned model can be much faster than a frontier model + long prompt
Fine-tuning is the wrong choice when:
- Your data changes often (you'd have to re-train repeatedly)
- You can't get a clean set of input/output examples
- You haven't yet pushed prompting to its limits (most fine-tuning needs go away with better prompts)
- You want the model to know facts that change (use RAG instead — fine-tuning a frozen fact is a maintenance nightmare)
A common heuristic: try prompting first, then RAG, then fine-tuning. Most teams stop at RAG.
The decision in one table
| Question | Prompting | RAG | Fine-tuning |
|---|---|---|---|
| Time to first result | Minutes | Hours | Days |
| Setup cost | $0 | $$ | $$$ |
| Per-query cost | $ | $$ (retrieval + bigger context) | $$ (often hosted infra cost) |
| Persistence of behavior | None — re-specify each time | Behavior is in the data, retrieved per query | Baked into the model |
| Updateability | Edit the prompt | Update the index | Re-train |
| Scales with data size | No (context window limit) | Yes | Yes (but with diminishing returns) |
| Citation/attribution | Hard | Easy (per chunk) | Hard |
| Latency added | None | 200–800ms | None (potentially less than base model) |
| When data changes daily | Easy | Easy | Painful |
| Volume needed to justify | None | Modest | High |
When you need a mix
Most real systems combine techniques. A few common patterns:
RAG + careful prompting
The most common production setup. RAG handles your data; a well-tuned prompt handles tone, format, and behavior. Fine-tuning isn't needed.
Fine-tuned model + RAG context
For domains where the base model lacks vocabulary or default behavior (medical, legal, niche technical), a fine-tune gives the model fluency; RAG gives it the current facts.
Multiple prompts with routing
Many systems use one model (often a smaller, fine-tuned classifier) to decide which prompt or pipeline to send a query to, then a frontier model to produce the answer. Cheaper than running the frontier model on everything.
The cost-to-quality calculus
When debating fine-tuning, do this math first:
- Volume: how many queries per month?
- Cost of base model + RAG: $/query × volume = monthly bill
- Cost of fine-tuned model: training cost (one-time, but periodically re-trained) + hosting cost + per-query cost
- Quality delta: realistic, not best-case. Measured on actual evals, not vibes.
If the quality delta is large (e.g. fine-tune is right 90% vs prompt is right 60%) and volume is meaningful, fine-tune. If the quality delta is small (e.g. 89% vs 87%) and volume is low, the engineering and ops overhead of fine-tuning probably isn't worth it.
For projects under ~$10K/month in API cost, fine-tuning rarely pays back.
What this looks like in 2026
The default stack for most production AI features:
- Frontier base model (Claude, GPT-4, Gemini)
- Careful system prompt (often heavily prompt-cached)
- RAG for any company-specific data
- Light fine-tuning only when prompting + RAG can't reach the quality bar
Heavy fine-tuning is increasingly reserved for cost-sensitive workloads (smaller fine-tuned model replacing a frontier one) or behavior the base model resists across every prompt you've tried.
What to read next
- How RAG works, and when to use it — the deeper dive on the most common choice
- Embeddings explained — the mechanism RAG retrieval uses
- LLM Cost + Quality Tuner — the structured exercise for the cost side of this decision
Next in this pillar
How AI agents work (and where they break)Get the next guide when it lands
One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.