Claude SkillUpdated 48 days ago

LLM Cost + Quality Tuner

Profiles an LLM-powered feature for cost-per-call and quality, and finds where to cut without breaking outputs.

Download skill (.zip)Or download whole pack

What it does

Takes an LLM-powered feature (chatbot, agent, RAG, summarizer) and analyzes where money is spent (model choice, token volume, retry overhead, cache hit rate) vs. where quality matters. Outputs concrete cost-reduction levers with their quality impact, so you can cut bills 30–70% without breaking the feature.

When to use

✓LLM bill growing faster than usage
✓Switching from prototype model to production scale
✓Investigating cost spikes in one specific feature
✓Pre-launch cost projection for a new LLM feature

When not to use

✗Pre-product — quality matters more than cost at <100 users
✗Cost is fine, you're looking for quality wins — that's a different exercise

Install

Download the .zip, then unzip into your Claude skills folder.

mkdir -p ~/.claude/skills
unzip ~/Downloads/llm-cost-and-quality-tuner.zip -d ~/.claude/skills/

# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.

SKILL.md

---
name: llm-cost-and-quality-tuner
description: Use when reducing the cost of an LLM-powered feature without sacrificing quality. Triggers on "LLM cost", "AI bill", "token reduction", "cheaper model", or "switch from GPT-4".
---

# LLM Cost + Quality Tuner

LLM cost is a knob you turn, not a fact you accept. Most production LLM features can be 30–70% cheaper without measurable quality loss. The first step is honest profiling — most teams don't actually know where their money goes.

## Required inputs

1. **Current setup** — model, prompt template, max tokens, context size, retries
2. **Usage data** — calls/day, average input + output tokens, cache hit rate
3. **Quality baseline** — how you currently measure quality (eval set, user feedback, anything)
4. **Constraints** — latency requirements, data-residency, model-provider preferences
5. **Cost trend** — last 3 months by week

If you don't have a quality baseline, build one before tuning. Cutting cost without a quality check is how features silently degrade.

## Profile first

### Where the money goes
| Component | $ / month | % of total |
|---|---|---|
| Input tokens (prompt + context) | | |
| Output tokens | | |
| Failed calls + retries | | |
| Cache misses | | |
| Higher-tier model usage (when cheaper would suffice) | | |

### Where quality lives
| Quality dimension | Current score | Acceptable floor |
|---|---|---|
| Factual accuracy | | |
| Format / structure compliance | | |
| Latency | | |
| User satisfaction (qualitative) | | |

## Levers (ranked by typical ROI)

### Lever 1: Route by complexity
- Cheap model for 80% of queries (classification, lookup, simple Q&A)
- Premium model for 20% that actually need it
- Implementation: a small classifier in front, or rule-based routing on query type
- Typical savings: 40–60%
- Quality risk: if router misclassifies, easy queries get expensive treatment (fine) or hard queries get cheap treatment (bad). Build router eval set.

### Lever 2: Aggressive prompt compression
- Strip examples not needed for the task at hand
- Use system prompt + RAG context instead of stuffed examples
- Remove "please" / "kindly" / etc. — tokens, no quality impact
- Typical savings: 15–40%
- Quality risk: minimal if measured

### Lever 3: Cache hit rate
- Semantic cache for common queries
- Exact-match cache for deterministic ones
- TTL based on data freshness needs
- Typical savings: 20–50% on repeat-heavy workloads
- Quality risk: stale results — set TTL by freshness sensitivity

### Lever 4: Max-tokens cap
- Most outputs don't need 2000 tokens of headroom
- Cap by use case (summary = 200, response = 500, JSON = 800)
- Typical savings: 10–25% on output cost
- Quality risk: cutting off mid-response — add format check + truncation detection

### Lever 5: Context window discipline
- Don't stuff entire conversation history if last 5 turns suffice
- Don't stuff full RAG results if top-3 chunks suffice
- Summarize old context before re-using
- Typical savings: 20–50% on input cost
- Quality risk: lost context — measure on tasks that depend on long history

### Lever 6: Batch + async
- For non-real-time tasks (summarization, enrichment), batch APIs are 50% cheaper
- Typical savings: 50% on async workloads
- Quality risk: none

### Lever 7: Provider arbitrage / open-weights for the easy 80%
- Self-host or use a cheap provider for low-stakes inference
- Premium provider for high-stakes
- Typical savings: 30–80% on routed traffic
- Quality risk: provider behavior differences — eval before switching

## Output

A ranked plan:

```
## Top 3 levers by expected savings × ease
1. [Lever] — projected savings $X/mo — quality risk + mitigation — effort estimate
2. ...
3. ...

## Cumulative savings if all 3 implemented
$X/mo (Y% of current)

## Quality eval plan
Before / after on [eval set], with specific pass criteria. If any quality metric drops below floor, revert.

## Implementation sequence
1. Week 1: Lever 1 (highest savings, lowest implementation risk)
2. Week 2: Quality measurement + Lever 2
3. ...
```

## Anti-patterns

- Switching the entire system to a cheaper model without routing — you'll lose the hard queries
- Cutting prompt examples without re-measuring — examples often carry more weight than they look
- "We'll just self-host" without accounting for ops cost (often higher than the API bill at low scale)

Example prompts

Once installed, try these prompts in Claude:

Our chatbot bill is $40K/mo on GPT-4. Help me cut it 50% without breaking the experience. [paste sample traces, current prompt, cache stats]
New agent feature — projected cost is $X / user / month. How do I get that to $Y? [paste prompt chain]

Recent changes

May 26, 2026New skill — 7 ranked levers to cut LLM cost 30-70% without breaking quality.

HN X LinkedIn Reddit