Skip to main content
← Back to Blog

The On-Call Engineer Who Slept Through the Night.

What actually changes when your team has proper upstream monitoring in place. A short story about the incident that wasn't.

By CheckUpstream Team

The On-Call Engineer Who Slept Through the Night

Maria's phone buzzed at 11:43 PM on a Wednesday. She was already in bed, laptop closed, halfway through an episode of something she'd forget the name of by morning.

The notification was from CheckUpstream, not PagerDuty. That distinction mattered. PagerDuty meant something in her stack was on fire. CheckUpstream meant something in someone else's stack was on fire.

The alert read: "OpenAI API: Degraded Performance. Elevated error rates on chat completions endpoint."

She picked up her phone, read the alert, and thought about it for a second. Her app used OpenAI for a summarization feature. Important, but not critical. Users could still browse, search, upload, and do everything else. The summarization button would show an error state, which the frontend already handled because they'd built a fallback for exactly this situation three months ago.

She typed a quick message in the team's Slack channel: "OpenAI having issues. Our summarization feature will be degraded. Frontend fallback is active. No action needed from us."

Then she went back to sleep.

At 6:15 AM, she checked her phone again. CheckUpstream had sent a follow-up at 1:12 AM: "OpenAI API: Resolved." Total incident duration: about 90 minutes. Her summarization feature had been running in fallback mode the entire time, showing cached results instead of live summaries. Three users had clicked the "Retry" button. Nobody had filed a support ticket.

That was the whole incident. No pages. No war room. No 2 AM debugging session. No post-mortem needed.

What made this boring

The interesting thing about Maria's night isn't what happened. It's everything that didn't happen. A year earlier, the same scenario would have played out very differently.

The old version: a user reports that the summarization feature is broken. A support ticket gets created. The on-call engineer gets paged. She logs in, pulls up the deployment history, digs through logs, spots OpenAI timeout errors, refreshes the OpenAI status page (which might or might not reflect the issue yet), and eventually figures out it's an upstream problem. By then it's 12:30 AM, she's fully awake, her adrenaline is up, and she's going to have a rough Thursday. It's the same five-minute scramble that costs real money, just at a worse hour.

The new version: she got a push notification, read one sentence, and went back to sleep.

The difference between these two scenarios isn't clever engineering. The team didn't build an elaborate failover system or a multi-provider AI gateway. They did two things:

  1. They set up upstream monitoring so they'd know about OpenAI issues before their users did.
  2. They built a simple fallback UI for the summarization feature (show cached results, display a "temporarily unavailable" message, offer a retry button).

Total engineering effort for both of those: about a day and a half. The monitoring setup took 20 minutes. The fallback UI took about a day, mostly because the designer wanted the error state to look nice.

The compounding value of boring

Here's what I think people miss about upstream monitoring. The value isn't in any single incident. It's in the cumulative effect on your team's operational health.

Maria's team has 4 engineers who rotate on-call weekly. Before upstream monitoring, each engineer averaged 2.3 pages per on-call week, and about 40% of those turned out to be upstream issues. That's roughly one unnecessary wake-up per week, per rotation.

After setting up monitoring, the upstream-caused pages dropped to near zero. Not because the outages stopped happening (they didn't), but because the team could handle them proactively during business hours or dismiss them at a glance during off-hours.

Over a quarter, that's roughly 12 fewer unnecessary pages. Twelve nights where someone slept instead of didn't. Twelve mornings where someone was productive instead of exhausted.

Nobody tracks the cost of sleep deprivation on engineering teams, but anyone who's been on-call knows it's real. The day after a 2 AM page is a lost day. You show up, you drink too much coffee, you push through, but you're not writing your best code. And if it happens often enough, people start looking for jobs that don't have on-call.

Upstream monitoring doesn't eliminate on-call pain. There will always be real incidents that need real responses at bad times. But it does eliminate the specific, avoidable pain of getting woken up for someone else's problem that you can't fix anyway.

The three things that matter

Over the past year, talking to teams who've adopted upstream monitoring, we've noticed that the ones who get the most value all do three things:

They categorize their dependencies by response type. Not every upstream outage needs the same response. Stripe going down is a "wake me up" event. PostHog going down is a "I'll check in the morning" event. Categorizing dependencies by their impact level means your alerts are calibrated correctly. The critical stuff wakes you up. Everything else sends a Slack message.

They build fallback UIs, not fallback systems. You don't need a second payment processor on standby (although if you do, great). What you need is a frontend that handles the error state gracefully. A loading spinner that eventually shows "This feature is temporarily unavailable, please try again in a few minutes" is a perfectly good fallback for most non-critical features. Users understand. They really do.

They use the alert as the start of communication, not investigation. When an upstream alert fires, the first action isn't to investigate. It's to communicate. Post in the team channel. Update the internal status page. Let support know. The investigation was already done by the service provider. Your job is to manage the impact on your users, not to diagnose what went wrong at Stripe.

Maria's morning

The next morning, Maria's standup went like this:

"OpenAI had a 90-minute outage last night. Summarization was in fallback mode. Three retry clicks, zero tickets. No action items."

The whole update took 15 seconds. Her manager nodded. The team moved on to sprint work.

That's it. That's the whole story.

And honestly? That's the point. The best on-call story is the one where nothing interesting happened. Where the alert fired, the fallback kicked in, someone glanced at their phone, and went back to sleep.

Good infrastructure is invisible. Good monitoring is boring. And the best incident is the one your users never even noticed.