Skip to main content
← Back to Blog

The Five Minutes That Cost $12,000.

A realistic walkthrough of how a Stripe outage cascaded through a SaaS checkout flow, and why the first five minutes decide everything.

By CheckUpstream Team

The Five Minutes That Cost $12,000

It was a Tuesday. 2:47 PM Eastern. The kind of afternoon where nothing interesting is supposed to happen.

A backend engineer on the payments team noticed something odd in the logs. Stripe API calls were timing out. Not all of them. Maybe one in five. Enough to make the error rate dashboard twitch but not enough to trigger the alert threshold, which was set at 10% error rate over a rolling 5-minute window.

By 2:49 PM, the timeout rate had climbed to 40%. Checkout was broken for nearly half the users trying to pay. Support tickets started landing. "I keep getting an error when I try to upgrade." "Payment failed, but my card is fine." "Is your site down?"

By 2:52 PM, the on-call engineer was paged. She opened the Stripe status page. It said "All Systems Operational." Green across the board. Status pages are structurally slow to update, but she didn't know that yet.

She spent the next three minutes checking the deployment history, rolling back a config change that had nothing to do with the problem, and wondering if the database migration from that morning had somehow corrupted something.

At 2:55 PM, Stripe updated their status page to "Investigating increased API error rates."

By then, 847 checkout attempts had failed. The company's average cart value was $14.20. That's $12,027 in abandoned revenue, and that's the conservative estimate because it doesn't count the customers who just left and never came back. One failure rippling through a checkout flow is bad enough; when multiple services share the same failure path, the math gets much worse.

The eight-minute gap

Here's what makes this sting. Stripe knew something was wrong internally well before 2:55 PM. Every major service provider has internal monitoring that catches problems minutes before they update the public status page. That's not malice. It's process. Someone has to verify the issue, classify its severity, draft the status update, and get it approved.

That process takes time. Usually somewhere between 5 and 15 minutes. Sometimes longer.

For the service provider, that's a reasonable workflow. For you, sitting downstream, those minutes are the most expensive minutes of your quarter.

The timeline nobody talks about

When people discuss incident response, they usually focus on MTTR (mean time to resolution). But there's a metric that matters just as much and gets almost no attention: time to correct attribution.

That's the time between "something is broken" and "we know it's not our fault."

In our Stripe example, the on-call engineer spent 6 minutes investigating her own systems. She checked recent deploys. She checked the database. She checked the CDN config. All of that was wasted effort because the problem was never in her stack.

This is what happens without upstream monitoring. Your incident response process assumes the problem is yours until proven otherwise. And proving otherwise takes time because you have to eliminate every internal cause before you start looking externally.

With upstream monitoring, the timeline collapses. At 2:47 PM, you get an alert: "Stripe API reporting degraded performance." You skip the internal investigation entirely. You switch to your fallback payment flow, or you put up a banner telling users to try again in a few minutes, or you queue the charges for retry. Whatever your playbook says. You're executing the right playbook from minute one instead of minute eight.

"But we check the status page"

I hear this a lot. Teams tell me they have a step in their incident runbook that says "check upstream status pages." And they do. Eventually.

The problem is that it's usually step 5 or step 6 in the runbook, after "check recent deploys," "check error logs," "check database health," and "check CDN." By the time someone gets to "check if Stripe is down," they've already spent 10 minutes on the wrong investigation.

And even when they do check, the status page might still say everything is fine because of that 5 to 15 minute lag.

Automated upstream monitoring flips the runbook. Instead of checking status pages as a late step in your investigation, you get proactive alerts the moment something changes. The information comes to you instead of you going to find it.

What $12,000 buys you

After that Tuesday, the payments team did the math. They calculated the revenue impact per minute of checkout downtime. For their traffic volume, it worked out to about $2,400 per minute during peak hours.

They also looked at how long the average upstream incident lasted. Across Stripe, their payment gateway, their email provider, and their auth service, the median incident duration was 23 minutes over the past year. But the median time to correct attribution (figuring out it wasn't their problem) was 11 minutes.

Cutting that attribution time from 11 minutes to under 1 minute was worth, conservatively, $24,000 per incident in reduced impact. They had about six upstream incidents per year that affected checkout.

The math was not complicated.

The fix is boring

This isn't a story about brilliant engineering or a clever architecture decision. The fix is almost disappointingly simple: monitor the services you depend on, and get alerted when they have problems.

That's it. No fancy failover systems. No multi-region redundancy. Just knowing what's happening upstream before your users tell you about it.

Connect your repos. Map your dependencies to their upstream status pages. Get alerts on Slack or PagerDuty, wherever your on-call engineer is already looking. The whole setup takes less time than reading this post.

The next time Stripe has a bad Tuesday, you'll know at 2:47, not 2:55. And those eight minutes? They're worth a lot more than you think.