Developer pack
Claude Skill
Postmortem Writer
Writes a blameless postmortem from incident notes — timeline, root cause, action items.
What it does
Takes raw incident notes (Slack thread, Datadog graphs, on-call timeline) and produces a blameless postmortem: the timeline, the root cause separated from contributing factors, action items with owners and dates. Inverts the usual ratio — most postmortems spend 80% on timeline and 20% on root cause; good ones do the opposite.
When to use
- ✓After a production incident with customer or revenue impact
- ✓After a near-miss that almost caused customer impact
- ✓After a recurring issue you keep papering over — write it up to break the cycle
When not to use
- ✗You haven't actually investigated yet — postmortem comes after RCA, not instead
- ✗The incident is still ongoing — write the timeline, save the analysis
Install
Download the .zip, then unzip into your Claude skills folder.
mkdir -p ~/.claude/skills
unzip ~/Downloads/postmortem-writer.zip -d ~/.claude/skills/
# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.SKILL.md
SKILL.md
---
name: postmortem-writer
description: Use when writing a blameless postmortem after a production incident. Triggers on "postmortem", "post-mortem", "incident report", "RCA writeup", or pasted incident channel transcripts.
---
# Postmortem Writer
Most postmortems fail because they spend 80% on timeline and 20% on root cause. Invert that. The timeline exists to support the analysis, not be the analysis.
A good postmortem makes a similar incident less likely. A great one teaches everyone who reads it something they didn't know about the system.
## Required inputs
1. **What broke and how it was detected** (alert? customer? engineer noticed?)
2. **Timeline notes** — Slack channel transcript, on-call log, key timestamps
3. **Customer / revenue impact** — duration, scope, what users experienced
4. **Mitigation** — what action restored service
5. **What's known so far** about the root cause
6. **Who was involved** in the response
If the user gives you only the timeline and no analysis attempt, push back. Ask: "What do you currently believe is the root cause? What's still unclear?"
## Structure
```
# Incident: [Short descriptive title — "Checkout 5xx for 23 min on Apr 27"]
## Summary
2-3 sentences. What happened, who was affected, how long, what fixed it.
A reader should be able to skip the rest of the doc and still know the story.
## Impact
- **Duration**: HH:MM - HH:MM UTC (XX min)
- **Scope**: which users / regions / features
- **Magnitude**: requests failed, $$$, NPS, support tickets, etc.
- **Detection**: how we found out (alert / customer / engineer)
- **Time to detect**: X min
- **Time to mitigate**: X min
## Root cause
1 paragraph. THE single technical cause, separated from contributing factors.
"The cache key included the user's locale; locale was undefined for SSO users;
the lookup matched any user, returning the wrong session." — that's a root cause.
NOT: "Multiple things went wrong including X, Y, and Z." Pick the actual cause.
## Contributing factors
3-5 bullets. The conditions that made the root cause possible:
- Why the bug existed (recent refactor, missing test, etc.)
- Why it wasn't caught (missing alert, no canary, deploy gates)
- Why it took so long to mitigate (runbook missing, on-call lacked access)
## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:02 | Deploy of build #4823 begins |
| 14:07 | First 5xx alert fires (#prod-checkout) |
| ... | ... |
Keep entries factual. Skip "decided to look into it" — useless filler.
## What went well
2-3 specific things. "Datadog APM trace let us narrow to one service in 90 seconds."
Don't be performative — only write it if it's true.
## What went wrong (response, not cause)
2-3 specific things. "Runbook for cache invalidation was for the old cluster."
Don't blame people; blame missing safeguards.
## Action items
| Action | Owner | Due | Tracking |
|--------|-------|-----|----------|
| Add canary deploy stage to checkout | @alex | May 15 | PROJ-3401 |
Each action must have:
- A specific deliverable (not "improve testing")
- A real owner (not "the team")
- A due date (not "soon")
- A linked ticket (not "TBD")
```
## Blameless framing
- Replace "X did Y" with "Y was done" or "the deploy was triggered" — focus on the system
- Reject "human error" as a root cause. The question is: why did the system let a human error cause this?
- If a person made a judgment call under uncertainty, that's information, not blame
## Anti-patterns
- Action items that are "be more careful" / "better testing" — not actions
- "Root cause: the bug" — that's the symptom, not the cause
- Postmortem with no action items — what was the point?
- Postmortem with 14 action items — none will get done; pick the 3-5 that matter
- Hand-wavy timestamps — get them from logs, not memory
## Output
Markdown. Include a "Distribution" section at the bottom: who should read this (team, broader engineering, customers if external). For incidents with customer impact, recommend whether a public-facing summary is needed.
Example prompts
Once installed, try these prompts in Claude:
- Write a postmortem for the checkout outage on April 27. Notes from the incident channel: [paste]. Datadog graphs attached.
- Postmortem for the database failover that took 47 minutes instead of the expected 5. Cause: stale DNS. [paste timeline]