Claude Skill

Postmortem Writer

Writes a blameless postmortem from incident notes — timeline, root cause, action items.

Download skill (.zip)Or download whole pack

What it does

Takes raw incident notes (Slack thread, Datadog graphs, on-call timeline) and produces a blameless postmortem: the timeline, the root cause separated from contributing factors, action items with owners and dates. Inverts the usual ratio — most postmortems spend 80% on timeline and 20% on root cause; good ones do the opposite.

When to use

✓After a production incident with customer or revenue impact
✓After a near-miss that almost caused customer impact
✓After a recurring issue you keep papering over — write it up to break the cycle

When not to use

✗You haven't actually investigated yet — postmortem comes after RCA, not instead
✗The incident is still ongoing — write the timeline, save the analysis

Install

Download the .zip, then unzip into your Claude skills folder.

mkdir -p ~/.claude/skills
unzip ~/Downloads/postmortem-writer.zip -d ~/.claude/skills/

# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.

SKILL.md

---
name: postmortem-writer
description: Use when writing a blameless postmortem after a production incident. Triggers on "postmortem", "post-mortem", "incident report", "RCA writeup", or pasted incident channel transcripts.
---

# Postmortem Writer

Most postmortems fail because they spend 80% on timeline and 20% on root cause. Invert that. The timeline exists to support the analysis, not be the analysis.

A good postmortem makes a similar incident less likely. A great one teaches everyone who reads it something they didn't know about the system.

## Required inputs

1. **What broke and how it was detected** (alert? customer? engineer noticed?)
2. **Timeline notes** — Slack channel transcript, on-call log, key timestamps
3. **Customer / revenue impact** — duration, scope, what users experienced
4. **Mitigation** — what action restored service
5. **What's known so far** about the root cause
6. **Who was involved** in the response

If the user gives you only the timeline and no analysis attempt, push back. Ask: "What do you currently believe is the root cause? What's still unclear?"

## Structure

```
# Incident: [Short descriptive title — "Checkout 5xx for 23 min on Apr 27"]

## Summary
2-3 sentences. What happened, who was affected, how long, what fixed it.
A reader should be able to skip the rest of the doc and still know the story.

## Impact
- **Duration**: HH:MM - HH:MM UTC (XX min)
- **Scope**: which users / regions / features
- **Magnitude**: requests failed, $$$, NPS, support tickets, etc.
- **Detection**: how we found out (alert / customer / engineer)
- **Time to detect**: X min
- **Time to mitigate**: X min

## Root cause
1 paragraph. THE single technical cause, separated from contributing factors.
"The cache key included the user's locale; locale was undefined for SSO users;
the lookup matched any user, returning the wrong session." — that's a root cause.

NOT: "Multiple things went wrong including X, Y, and Z." Pick the actual cause.

## Contributing factors
3-5 bullets. The conditions that made the root cause possible:
- Why the bug existed (recent refactor, missing test, etc.)
- Why it wasn't caught (missing alert, no canary, deploy gates)
- Why it took so long to mitigate (runbook missing, on-call lacked access)

## Timeline (UTC)

| Time | Event |
|------|-------|
| 14:02 | Deploy of build #4823 begins |
| 14:07 | First 5xx alert fires (#prod-checkout) |
| ... | ... |

Keep entries factual. Skip "decided to look into it" — useless filler.

## What went well
2-3 specific things. "Datadog APM trace let us narrow to one service in 90 seconds."
Don't be performative — only write it if it's true.

## What went wrong (response, not cause)
2-3 specific things. "Runbook for cache invalidation was for the old cluster."
Don't blame people; blame missing safeguards.

## Action items

| Action | Owner | Due | Tracking |
|--------|-------|-----|----------|
| Add canary deploy stage to checkout | @alex | May 15 | PROJ-3401 |

Each action must have:
- A specific deliverable (not "improve testing")
- A real owner (not "the team")
- A due date (not "soon")
- A linked ticket (not "TBD")
```

## Blameless framing

- Replace "X did Y" with "Y was done" or "the deploy was triggered" — focus on the system
- Reject "human error" as a root cause. The question is: why did the system let a human error cause this?
- If a person made a judgment call under uncertainty, that's information, not blame

## Anti-patterns

- Action items that are "be more careful" / "better testing" — not actions
- "Root cause: the bug" — that's the symptom, not the cause
- Postmortem with no action items — what was the point?
- Postmortem with 14 action items — none will get done; pick the 3-5 that matter
- Hand-wavy timestamps — get them from logs, not memory

## Output

Markdown. Include a "Distribution" section at the bottom: who should read this (team, broader engineering, customers if external). For incidents with customer impact, recommend whether a public-facing summary is needed.

Example prompts

Once installed, try these prompts in Claude:

Write a postmortem for the checkout outage on April 27. Notes from the incident channel: [paste]. Datadog graphs attached.
Postmortem for the database failover that took 47 minutes instead of the expected 5. Cause: stale DNS. [paste timeline]