Production monitoring: when your app breaks at 3am

Most "your app is live" tutorials end the moment the deploy goes green. The day after the deploy is where the real work starts. Your app will break. The question isn't whether, it's whether you find out before your users do, or after the third support email shows up at midnight.

Here's the unglamorous truth nobody puts in launch posts: a small product running for six months without monitoring will silently fail at least a dozen times. A bad migration. An expired API key. A region outage at your provider. A regex you wrote at 11pm. Each one is a slow-bleed churn event you never saw.

The fix is not Datadog. The fix is four small pieces, wired up in an afternoon, for somewhere between zero and thirty bucks a month.

The minimum viable monitoring stack

Error tracking. Use Sentry. Drop the SDK in, add your DSN to the env vars, and every uncaught exception in your app lands in a dashboard with a stack trace, a user count, and a frequency. The free tier is 5,000 events a month, which is plenty for a small product. This is the layer that catches the silent 500s that quietly turn into churn.

npm install @sentry/nextjs
npx @sentry/wizard@latest -i nextjs

That's it. The wizard generates the config files, you paste the DSN, you redeploy. From that point on, every thrown error has a permanent record.

Logs. Vercel and Railway both ship reasonable log views out of the box. You don't need Datadog yet, you don't need Logtail yet. Learn how to read what you already have. In Vercel, open your project, click Logs, set the level filter to "error" and the status filter to 500. That single view catches most of what goes wrong. Railway has the same idea under "Deployments > Logs," with a status code filter built in.

vercel logs my-app-prod --follow | grep -E "5[0-9]{2}"

That tail-and-grep is your "is anything on fire right now" command.

Uptime checks. Better Stack, UptimeRobot, or even a free Cron-job.org GET request hitting your /api/health endpoint every minute. The cheapest insurance against a silent outage you can buy. Better Stack gives you 5 monitors free with SMS and email alerts.

A real health endpoint. This is the eight lines that ties it all together.

// app/api/health/route.ts
import { Pool } from "pg";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

export async function GET() {
  try {
    await pool.query("select 1");
    return Response.json({ ok: true });
  } catch {
    return new Response("db down", { status: 500 });
  }
}

If your DB is down, this returns 500. Your uptime check fires. You find out before your users do.

Setup order

Day one: wire up Sentry. It takes 15 minutes and immediately starts collecting. You will see your first real error within 48 hours.

Day two: add the /api/health endpoint and point an uptime monitor at it. Now you have outside-in coverage too.

Day three through forever: leave logs as-is until something is actually weird. When something is weird, you'll go read them. That's the job.

When you have real users (call it a few hundred), come back and configure alert routing, who gets paged on what, by what channel.

Alert fatigue is the actual failure mode

If everything pages you, nothing pages you. The alerts that wake you at 3am should be exactly two things: "the site is down" and "errors just spiked 10x baseline." Everything else can wait for coffee.

A sane Sentry alert rule looks like this: "When the number of events in an issue is more than 50 in 1 hour for the first time, send to email." Translation: I want to know about new bugs that affect a lot of people, not the one user with an ad blocker that breaks one component. You can layer in a second rule for "any error in routes containing /api/checkout" because billing breakage is in its own category.

The uptime check should ping every 1 minute and only alert after 2 consecutive failures. One failed check is a hiccup. Two is real.

The "why is this slow" stage

That comes later. Once you have users complaining about speed, not before. Vercel Speed Insights and Railway metrics give you P95 latency by route. PageSpeed Insights and Lighthouse cover frontend perf. Don't chase any of this until someone has actually told you the app feels slow. Premature optimization eats weekends.

Logs versus errors

People conflate them. Errors are exceptions you want fixed, captured by Sentry, deduplicated by stack trace. Logs are the breadcrumbs you read when something is weird but not technically broken: a slow checkout, a webhook arriving twice, a user report you can't reproduce. Both matter. Neither replaces the other. Sentry tells you what's on fire, logs tell you why.

The cost reality

Sentry free tier covers a small product. Better Stack uptime is free for 5 monitors. Vercel and Railway logs are included in the plans you're already paying for. Total monthly cost of "I will know when my app breaks": $0 to $30. There is no excuse.

Closing Pillar 4

Across this pillar you took a working prototype, picked a host, deployed it, learned to ship from a terminal, and now you have eyes on the thing while you sleep. You own the stack. You own the alerts. You own the uptime story. Nobody is going to surprise you with a dashboard you never checked.

Pillar 5 is the productivity playbooks. Now that the infra is sorted and you know when things break, the next move is to stop manually doing the things you keep manually doing. Prompt libraries, agent loops, the small habits that turn one shipped product into a pipeline.