Claude Skill

Observability Designer

Defines what to log, what to metric, what to trace — and what NOT to — for a service you can debug at 3am.

Download skill (.zip)Or download whole pack

What it does

Takes a service description, traffic profile, and known failure modes, and outputs a structured observability plan: log lines that matter, metrics that page the right person, traces that pinpoint slow steps, alerts that fire on real problems rather than noise. Pairs with the SLO designer skill.

When to use

✓New service going to prod, no observability set up yet
✓Existing service where "we have no idea what happened" is a recurring incident pattern
✓Migrating observability vendors and want to redesign rather than port

When not to use

✗You already have good observability and just need a specific alert tuned
✗Pre-prototype service with no real traffic

Install

Download the .zip, then unzip into your Claude skills folder.

mkdir -p ~/.claude/skills
unzip ~/Downloads/observability-designer.zip -d ~/.claude/skills/

# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.

SKILL.md

---
name: observability-designer
description: Use when designing observability (logs, metrics, traces, alerts) for a service. Triggers on "observability", "logging strategy", "what to monitor", "instrument this".
---

# Observability Designer

Observability is the answer to "what happened?" at 3am. Done well, the on-call sees the issue in 5 minutes. Done badly, they spend an hour grepping logs before finding the right line.

## Required inputs

1. **Service description** — what it does, who depends on it
2. **Stack** — language, framework, infra
3. **Traffic profile** — RPS, peak hours, distribution
4. **Known failure modes** — what's gone wrong historically
5. **What's measured today** (if any) — to avoid duplicating

## Output

### Logs — what to record
- **Every request** — method, path, user/tenant, latency, status, bytes
- **Every external call** — service, latency, status, retry count
- **Every error** — exception type, message, stack, correlation ID
- **Key state transitions** — order created, payment captured, refund issued

Structured JSON. Skip "INFO" logs that nobody reads. Don't log secrets, PII without consent, or full request/response bodies for auth endpoints.

### Metrics — what to measure
- **The four golden signals**: latency, traffic, errors, saturation
- **Business metrics**: orders/min, successful checkouts/min, payment failures/min
- **Resource metrics**: DB pool used %, queue depth, GC pauses

Tag dimensions you'll actually slice on (env, region, endpoint). Avoid high-cardinality tags (user_id) that explode storage.

### Traces — when to instrument
- Request flows through 3+ services → trace
- Long async pipelines → trace
- Single-process simple endpoints → don't bother, logs are enough

### Alerts — what to wake people for
- **Page-worthy**: customer-facing errors > threshold, SLO burn-rate fast, hard dependency down
- **Ticket-worthy**: SLO burn-rate slow, capacity trending toward limit
- **Email/Slack**: weekly trends, anomaly detection low-confidence

Don't page on resource metrics directly (CPU%, mem%). Those are causes. Alert on outcomes (error rate, latency) and let resource metrics be context.

### Dashboards
- One overview per service: 4 golden signals + business metrics
- Per-feature deep-dive dashboards as needed
- Skip dashboards nobody opens — they age into lies

## Anti-patterns
- Logging everything "just in case" (creates noise, cost, GDPR risk)
- Alerts that fire 10x/day and get muted
- Metrics no one looks at
- Dashboards that get out of date

## What to NOT log
- Secrets / API keys / tokens
- PII without explicit need
- Full request bodies on sensitive endpoints
- Full response bodies on large responses (sample instead)

Example prompts

Once installed, try these prompts in Claude:

Design observability for our checkout API. Traffic 50 RPS peak. Stack: Node + Postgres. Known failure modes: timeouts to Stripe, deadlocks on the orders table. [other context]
We have logs but not metrics. Where do I start without instrumenting every line?

HN X LinkedIn Reddit