Fine-tuning

What it means

Fine-tuning is what you do when prompting alone isn't enough. You take an already-trained model (Llama, Mistral, Qwen, GPT-4o-mini via the API) and run more training on a curated dataset — usually a few thousand to a few million examples — that teaches it your task, your domain vocabulary, or your preferred output format. The base model's weights shift slightly to favor your data without forgetting everything else. In practice, most teams should not fine-tune. A good prompt plus RAG plus a frontier model beats a fine-tune of a smaller model in the vast majority of business cases — and it's cheaper and faster to iterate on. Fine-tuning makes sense when you have (a) a narrow, repetitive task, (b) thousands of high-quality labeled examples, and (c) latency or cost requirements that rule out frontier APIs. Classic fits: classification, structured extraction, style mimicry, on-device assistants. Modern fine-tuning is rarely "full fine-tuning" anymore — touching every parameter is expensive and risks destroying the base model's general ability. Instead, almost everyone uses LoRA or one of its variants, which trains tiny adapter layers and leaves the original weights frozen. OpenAI, Anthropic (via Claude's fine-tuning beta on Bedrock), and every open-model host offer LoRA-based fine-tuning as the default.

Example

A legal-tech startup fine-tunes Llama 4 8B on 50,000 contract-clause classification examples. The fine-tuned model hits 96% accuracy at one-twentieth the cost of calling GPT-5 for the same task — and runs on their own GPUs.

Why it matters

Fine-tuning is the most overhyped tool in the AI builder's box. People reach for it before they've actually tried prompting properly, and end up with worse models, more ops burden, and frozen behavior. Knowing when fine-tuning is and isn't the right answer is one of the higher-leverage skills in applied AI.

What it means

Example

Why it matters

Related terms

See it in a comparison