Developer pack
Claude Skill

On-call Runbook Builder

Builds a runbook for a specific alert — symptoms, diagnosis, remediation, escalation.

What it does

Given an alert (PagerDuty / Datadog / Grafana) and the system it watches, produces a runbook the on-call can follow at 3am: what the alert means, how to confirm the issue is real, the 2-3 most likely causes with diagnostic queries, the safe remediation steps, and when to escalate. Tested by whether a tired engineer who didn't write the system can resolve it.

When to use

  • You're owning an alert and there's no runbook (or the runbook is "ask Alice")
  • A new service is going on-call rotation
  • A specific alert keeps paging and the team doesn't know what to do

When not to use

  • The alert is broken — fix the alert before writing a runbook for the noise
  • You don't actually understand what causes this alert to fire

Install

Download the .zip, then unzip into your Claude skills folder.

mkdir -p ~/.claude/skills
unzip ~/Downloads/on-call-runbook-builder.zip -d ~/.claude/skills/

# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.

SKILL.md

SKILL.md
---
name: on-call-runbook-builder
description: Use when writing a runbook for a specific alert or system failure mode. Triggers on "runbook", "on-call playbook", "alert response doc", "what to do when X fires".
---

# On-call Runbook Builder

A runbook is read at 3am by someone with low context, low blood sugar, and a customer-facing outage on the line. It must be skimmable, specific, and safe.

If the on-call has to read 3 paragraphs to know what command to run, the runbook failed.

## Required inputs

1. **The alert** — exact name, monitor / dashboard link, threshold
2. **The system** — service name, language, key dependencies
3. **What the alert actually means** — the underlying user/system condition
4. **Known causes** — historical reasons this has fired
5. **Who else owns adjacent systems** — escalation contacts

If the user can't articulate what the alert means in plain language, the runbook will fail. Push back: "When this fires, what's actually happening that's bad for users?"

## Structure

```
# Runbook: [Alert name]

## What this alert means
1 sentence in plain language. "Checkout requests are taking longer than 2s at p99,
which means users see a slow page or a timeout."

## Severity
- **SEV1** if X (e.g. customer impact >5%)
- **SEV2** otherwise
- Page criteria: [who else gets paged at what threshold]

## Confirm it's real (90 seconds)
1. Open [exact dashboard link]
2. Check [specific metric] — is it actually elevated?
3. Check [related metric] to rule out [common false positive]

If not real: silence the alert with a short note in #ops. Otherwise continue.

## Most likely causes (in order of historical frequency)

### Cause 1: [name] (X% of past incidents)
**Diagnose**:
- Run: \`exact command or query\`
- Look in: [Datadog log query / Sentry / specific dashboard panel]
- Sign it's THIS cause: [what the data looks like]

**Remediate**:
- Step-by-step. Each step a single command or click.
- Mark which steps are SAFE TO ROLLBACK and which are not
- Expected time to recovery

### Cause 2: [name]
[Same structure]

### Cause 3: [name]
[Same structure]

## If none of the above

Symptoms that don't match the known causes — likely something new.
- Capture: \`logs query\`, \`metrics dashboard link\`, \`trace ID\`
- Page: [next escalation contact]
- DO NOT: [destructive actions to avoid]

## Escalation
- 15 min unresolved → page [name / role]
- 30 min unresolved → page [name / role]
- Customer-facing → comms lead in [#channel]

## After the incident
- Open postmortem if duration > X min or customer impact
- Add anything new to this runbook (especially new failure modes)
```

## Operability requirements

Every runbook must include:
- **Exact commands** — not "check the logs" but `kubectl logs -n prod checkout-api`
- **Exact dashboard links** — not "open Datadog" but the specific saved view
- **Permissions needed** — does on-call need prod DB access? AWS credentials? Mark it.
- **Rollback path for any change** — "to revert: `kubectl rollout undo deploy/checkout`"

## Anti-patterns

- "Investigate the issue" — useless without a starting point
- "Restart the service" as step 1 — masks problems and loses state
- Runbook that assumes you wrote the service — write for someone who didn't
- Runbook that hasn't been tested in a fire drill — it's wishful thinking until then

## Tone

- Imperative. "Run X. Check Y. If Z, do A." Not "you might want to consider..."
- Short. Bullets and code blocks beat prose at 3am.
- Specific. Service names, query strings, channel names — copy-pasteable.

## Output

Markdown. End with: "Last reviewed: [date]. Next review: [3 months out]." Stale runbooks are worse than no runbooks.

Example prompts

Once installed, try these prompts in Claude:

  • Build a runbook for the "Checkout-API p99 latency > 2s" Datadog alert. The service is a Node + Postgres app behind Cloudflare.
  • Runbook for "Kafka consumer lag > 100k" alert. Consumer is the order-fulfillment service.