Developer pack
Claude Skill
On-call Runbook Builder
Builds a runbook for a specific alert — symptoms, diagnosis, remediation, escalation.
What it does
Given an alert (PagerDuty / Datadog / Grafana) and the system it watches, produces a runbook the on-call can follow at 3am: what the alert means, how to confirm the issue is real, the 2-3 most likely causes with diagnostic queries, the safe remediation steps, and when to escalate. Tested by whether a tired engineer who didn't write the system can resolve it.
When to use
- ✓You're owning an alert and there's no runbook (or the runbook is "ask Alice")
- ✓A new service is going on-call rotation
- ✓A specific alert keeps paging and the team doesn't know what to do
When not to use
- ✗The alert is broken — fix the alert before writing a runbook for the noise
- ✗You don't actually understand what causes this alert to fire
Install
Download the .zip, then unzip into your Claude skills folder.
mkdir -p ~/.claude/skills
unzip ~/Downloads/on-call-runbook-builder.zip -d ~/.claude/skills/
# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.SKILL.md
SKILL.md
---
name: on-call-runbook-builder
description: Use when writing a runbook for a specific alert or system failure mode. Triggers on "runbook", "on-call playbook", "alert response doc", "what to do when X fires".
---
# On-call Runbook Builder
A runbook is read at 3am by someone with low context, low blood sugar, and a customer-facing outage on the line. It must be skimmable, specific, and safe.
If the on-call has to read 3 paragraphs to know what command to run, the runbook failed.
## Required inputs
1. **The alert** — exact name, monitor / dashboard link, threshold
2. **The system** — service name, language, key dependencies
3. **What the alert actually means** — the underlying user/system condition
4. **Known causes** — historical reasons this has fired
5. **Who else owns adjacent systems** — escalation contacts
If the user can't articulate what the alert means in plain language, the runbook will fail. Push back: "When this fires, what's actually happening that's bad for users?"
## Structure
```
# Runbook: [Alert name]
## What this alert means
1 sentence in plain language. "Checkout requests are taking longer than 2s at p99,
which means users see a slow page or a timeout."
## Severity
- **SEV1** if X (e.g. customer impact >5%)
- **SEV2** otherwise
- Page criteria: [who else gets paged at what threshold]
## Confirm it's real (90 seconds)
1. Open [exact dashboard link]
2. Check [specific metric] — is it actually elevated?
3. Check [related metric] to rule out [common false positive]
If not real: silence the alert with a short note in #ops. Otherwise continue.
## Most likely causes (in order of historical frequency)
### Cause 1: [name] (X% of past incidents)
**Diagnose**:
- Run: \`exact command or query\`
- Look in: [Datadog log query / Sentry / specific dashboard panel]
- Sign it's THIS cause: [what the data looks like]
**Remediate**:
- Step-by-step. Each step a single command or click.
- Mark which steps are SAFE TO ROLLBACK and which are not
- Expected time to recovery
### Cause 2: [name]
[Same structure]
### Cause 3: [name]
[Same structure]
## If none of the above
Symptoms that don't match the known causes — likely something new.
- Capture: \`logs query\`, \`metrics dashboard link\`, \`trace ID\`
- Page: [next escalation contact]
- DO NOT: [destructive actions to avoid]
## Escalation
- 15 min unresolved → page [name / role]
- 30 min unresolved → page [name / role]
- Customer-facing → comms lead in [#channel]
## After the incident
- Open postmortem if duration > X min or customer impact
- Add anything new to this runbook (especially new failure modes)
```
## Operability requirements
Every runbook must include:
- **Exact commands** — not "check the logs" but `kubectl logs -n prod checkout-api`
- **Exact dashboard links** — not "open Datadog" but the specific saved view
- **Permissions needed** — does on-call need prod DB access? AWS credentials? Mark it.
- **Rollback path for any change** — "to revert: `kubectl rollout undo deploy/checkout`"
## Anti-patterns
- "Investigate the issue" — useless without a starting point
- "Restart the service" as step 1 — masks problems and loses state
- Runbook that assumes you wrote the service — write for someone who didn't
- Runbook that hasn't been tested in a fire drill — it's wishful thinking until then
## Tone
- Imperative. "Run X. Check Y. If Z, do A." Not "you might want to consider..."
- Short. Bullets and code blocks beat prose at 3am.
- Specific. Service names, query strings, channel names — copy-pasteable.
## Output
Markdown. End with: "Last reviewed: [date]. Next review: [3 months out]." Stale runbooks are worse than no runbooks.
Example prompts
Once installed, try these prompts in Claude:
- Build a runbook for the "Checkout-API p99 latency > 2s" Datadog alert. The service is a Node + Postgres app behind Cloudflare.
- Runbook for "Kafka consumer lag > 100k" alert. Consumer is the order-fulfillment service.