Fine-tuning vs RAG vs prompting: which one fits your problem

The three ways to make a model behave better for your case — cost, persistence, updateability, when to use each, and when to mix them. With the decision matrix and the math for "is fine-tuning worth it."

9 min read·Updated May 27, 2026

Fine-tuning vs RAG vs prompting: which one fits your problem

These are the three ways to make an AI model behave better for your specific use case. They solve different problems, cost different amounts, and fail in different ways. Picking the wrong one is the most common reason AI projects miss their goals.

The short version:

  • Prompting = tell the model what to do in the moment. Cheapest, fastest, weakest persistence.
  • RAG (retrieval-augmented generation) = look up relevant information at query time and include it in the prompt. Good for knowledge that changes, scales to large data, leaves the model itself unchanged.
  • Fine-tuning = update the model's weights by training it on your data. Strongest persistence, highest cost, hardest to update.

Most production AI systems use a mix of all three.

When to use prompting

Prompting is the default. You write better instructions, give examples, structure the prompt — and the model behaves better. No infrastructure, no training run, no data pipeline.

Use prompting when:

  • The task can be specified clearly in instructions
  • You can fit the relevant context in the prompt
  • The right behavior changes often (a hardcoded fine-tune would go stale)
  • You're iterating fast and don't want a 3-day training loop between attempts
  • The cost of getting it wrong is low

Prompting hits a wall when:

  • The instructions get so long they crowd out useful context
  • You need behavior the base model resists (specific tone, specific format with no examples to give)
  • You're sending the same long system prompt millions of times (cost adds up; consider caching or fine-tuning)

See What makes a prompt work for the depth on prompting itself.

When to use RAG

RAG addresses a specific gap: the model doesn't know your data. Either because your data is private (internal docs, customer records, your codebase), or because it's recent (newer than the training cutoff), or because there's too much of it to fit in any context window.

A RAG system:

  1. Stores your data as searchable chunks (usually with embeddings — see Embeddings explained)
  2. At query time, retrieves the chunks most relevant to the question
  3. Stuffs those chunks into the prompt as context
  4. Asks the LLM to answer using them

Use RAG when:

  • The model needs to answer based on your specific data (docs, knowledge base, transcripts, codebase)
  • That data changes often — updating a search index is easy; fine-tuning is hard
  • The data is too large to fit in any context window
  • You need citations — RAG can tell the user which source chunk the answer came from
  • You're dealing with regulated content where the LLM hallucinating an answer is unacceptable and falling back to "I don't have that information" is okay

RAG hits limits when:

  • The relevant information is scattered across many chunks (retrieval misses the big picture)
  • The task requires reasoning across many documents at once (not "look up X" but "synthesize X across 50 sources")
  • The chunks contain misleading or contradictory information and the model can't tell which to trust
  • Latency matters and the retrieval step adds 200–800ms

See How RAG works, and when to use it for the deeper RAG explanation.

When to use fine-tuning

Fine-tuning changes the model itself. You take a base model, train it further on your data, and the resulting model has internalized something about that data (style, knowledge, format, behavior).

Use fine-tuning when:

  • You need a specific behavior persistently — every output, no need to re-specify in the prompt
  • You have high-quality training data in the format you want the model to imitate (hundreds to thousands of examples minimum)
  • The base model resists prompting for your behavior — you've tried better prompts and they're not enough
  • You're running enough volume to justify the cost (training and hosting a fine-tuned model is more expensive per token than the base model API)
  • Latency matters — a smaller fine-tuned model can be much faster than a frontier model + long prompt

Fine-tuning is the wrong choice when:

  • Your data changes often (you'd have to re-train repeatedly)
  • You can't get a clean set of input/output examples
  • You haven't yet pushed prompting to its limits (most fine-tuning needs go away with better prompts)
  • You want the model to know facts that change (use RAG instead — fine-tuning a frozen fact is a maintenance nightmare)

A common heuristic: try prompting first, then RAG, then fine-tuning. Most teams stop at RAG.

The decision in one table

QuestionPromptingRAGFine-tuning
Time to first resultMinutesHoursDays
Setup cost$0$$$$$
Per-query cost$$$ (retrieval + bigger context)$$ (often hosted infra cost)
Persistence of behaviorNone — re-specify each timeBehavior is in the data, retrieved per queryBaked into the model
UpdateabilityEdit the promptUpdate the indexRe-train
Scales with data sizeNo (context window limit)YesYes (but with diminishing returns)
Citation/attributionHardEasy (per chunk)Hard
Latency addedNone200–800msNone (potentially less than base model)
When data changes dailyEasyEasyPainful
Volume needed to justifyNoneModestHigh

When you need a mix

Most real systems combine techniques. A few common patterns:

RAG + careful prompting

The most common production setup. RAG handles your data; a well-tuned prompt handles tone, format, and behavior. Fine-tuning isn't needed.

Fine-tuned model + RAG context

For domains where the base model lacks vocabulary or default behavior (medical, legal, niche technical), a fine-tune gives the model fluency; RAG gives it the current facts.

Multiple prompts with routing

Many systems use one model (often a smaller, fine-tuned classifier) to decide which prompt or pipeline to send a query to, then a frontier model to produce the answer. Cheaper than running the frontier model on everything.

The cost-to-quality calculus

When debating fine-tuning, do this math first:

  • Volume: how many queries per month?
  • Cost of base model + RAG: $/query × volume = monthly bill
  • Cost of fine-tuned model: training cost (one-time, but periodically re-trained) + hosting cost + per-query cost
  • Quality delta: realistic, not best-case. Measured on actual evals, not vibes.

If the quality delta is large (e.g. fine-tune is right 90% vs prompt is right 60%) and volume is meaningful, fine-tune. If the quality delta is small (e.g. 89% vs 87%) and volume is low, the engineering and ops overhead of fine-tuning probably isn't worth it.

For projects under ~$10K/month in API cost, fine-tuning rarely pays back.

What this looks like in 2026

The default stack for most production AI features:

  1. Frontier base model (Claude, GPT-4, Gemini)
  2. Careful system prompt (often heavily prompt-cached)
  3. RAG for any company-specific data
  4. Light fine-tuning only when prompting + RAG can't reach the quality bar

Heavy fine-tuning is increasingly reserved for cost-sensitive workloads (smaller fine-tuned model replacing a frontier one) or behavior the base model resists across every prompt you've tried.

What to read next

Get the next guide when it lands

One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.