Investigate a production incident across logs, deploys, and config

{{symptom: e.g. the /checkout endpoint started returning 500s about an hour ago}}.

Correlate across:
1. **Recent deploys** — what shipped in the last 24h, who, what changed
2. **Recent config / feature flag changes** — anything toggled in the relevant window
3. **Logs** — error rate, top error messages, when the spike started, which hosts/regions
4. **Dependencies** — upstream services, DB, cache, third-party APIs — any of them degraded
5. **Traffic** — request volume, request shape, anything unusual (new user-agent, new geo)
6. **Resource pressure** — CPU, memory, connection pool, disk

Output a ranked list of likely causes (most likely first), each with:
- The signal that points to it
- The signal that would rule it out
- The single fastest action to confirm or deny

End with the one thing I should check first.

·Open in·Share

incidentdebuggingoncall

Investigate a production incident across logs, deploys, and config

More code prompts

Go deeper