Claude Skill

SLO + Error Budget Designer

Defines SLOs and error budgets for a service so reliability becomes a measurable trade-off, not a vibe.

Download skill (.zip)Or download whole pack

What it does

Takes a service description, traffic patterns, and business criticality, and produces SLO definitions (SLI, target, measurement window), error budget allocation, and the alerting + policy rules tied to budget burn. Forces explicit choices about how reliable is reliable enough — instead of "everything must be 100%."

When to use

✓New service going to prod and you need SLOs before it pages someone at 3am
✓Existing service where reliability has become a moving target
✓After an incident where unclear SLOs made "is this a real issue" debatable

When not to use

✗Pre-prototype services with no users yet — SLOs are premature
✗A reliability problem the team already knows how to fix — don't process-up a fix

Install

Download the .zip, then unzip into your Claude skills folder.

mkdir -p ~/.claude/skills
unzip ~/Downloads/slo-and-error-budget-designer.zip -d ~/.claude/skills/

# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.

SKILL.md

---
name: slo-and-error-budget-designer
description: Use when defining or revising SLOs and error budgets for a service. Triggers on "SLO", "SLI", "error budget", "reliability target", "alerting policy".
---

# SLO + Error Budget Designer

The goal of an SLO is not to be aspirational — it's to be the truth that the team has agreed to defend. If your SLO is set higher than your actual capability, your error budget is always burned and the SLO is theater. If it's lower, you're under-investing in users who notice.

## Required inputs

1. **Service description** — what it does, what depends on it
2. **Traffic profile** — RPS, peak hours, geographic distribution
3. **User-facing criticality** — payments? auth? notifications? internal admin?
4. **Current observed reliability** — actual uptime, p50/p95/p99 latency, error rate
5. **Cost of unreliability** — revenue loss / hour, customer trust, regulatory exposure
6. **What 'reliable' means to the user** — fast load? successful checkout? data freshness?

## Output

### 1. SLIs (what we measure)
For each, the precise definition. Not "uptime" — "% of HTTP requests to /checkout/* with status 2xx, measured at the edge load balancer, excluding requests with client error status codes."

Typical SLIs:
- **Availability**: successful_requests / total_requests
- **Latency**: % of requests under target latency at p95/p99
- **Quality**: % of requests with the expected response shape / correctness
- **Freshness**: % of reads where data is within N seconds of source-of-truth

Skip CPU / memory — those are causes, not SLIs.

### 2. SLO targets
For each SLI, the target with measurement window:
- "99.9% availability over a rolling 28 days"
- "p95 latency < 200ms over a rolling 28 days"

**Sanity check**: is the proposed SLO within the observed performance for the last quarter? If not, it's aspirational — either invest first or set a realistic target.

### 3. Error budget
- Total budget = (1 - SLO) × time window
- E.g. 99.9% over 28 days = 40.3 min of allowed downtime
- Express this in user-impact terms ("we can fail ~36,000 checkout requests per 28 days before we breach")

### 4. Burn-rate alerting
Multiple alerts at different burn rates:
- **Fast burn** (consuming 1 month of budget in 1 hour) → page on-call
- **Medium burn** (consuming 1 month of budget in 6 hours) → ticket, no page
- **Slow burn** (consuming 1 month of budget in 3 days) → weekly review

Single-threshold alerts (e.g. "any error rate > X%") are noisy. Burn-rate alerts catch real problems and ignore noise.

### 5. Error budget policy
When the budget is exhausted:
- What stops (feature work, deployments to this service, new dependencies)
- What starts (reliability work, runbook updates, incident review)
- Who decides exceptions (named role, not "team")

Without a policy, the SLO is observability. With one, it's a contract.

### 6. Review cadence
- Weekly: burn-rate trend
- Monthly: SLO vs. observed; recalibrate if observed is consistently above
- Quarterly: re-examine the SLO targets and whether they still match user-perceived reliability

## Anti-patterns

- 99.99%+ targets without a serious investment in HA — you can't ship at that level on a single-region service
- SLOs measured at the wrong layer (server-side success when the user sees client-side timeouts)
- SLOs averaged over too-long windows (a 90-day window hides a bad week)
- Error budgets that nobody actually burns down to zero — the budget is the constraint, not a vanity number

Example prompts

Once installed, try these prompts in Claude:

Design SLOs for our checkout API. Traffic: 50 RPS peak, 99.9% currently, payments downstream. [details]
We're page-overloaded. Help me right-size SLOs and the alert policy. [paste current SLOs + last quarter's pages]

HN X LinkedIn Reddit