June 1, 2025

Introducing CheckUpstream.

We got tired of finding out about outages from our users. So we built a tool that tells us first.

By CheckUpstream Team

Introducing CheckUpstream

It was 2 AM on a Tuesday. Our on-call engineer got paged because checkout was failing for about 30% of users. She spent 45 minutes digging through logs, checking recent deploys, reverting a feature flag that looked suspicious. Nothing helped. Eventually someone on the team checked Stripe's status page. Elevated error rates on the Payments API, first reported an hour ago.

We didn't have a bug. We had an upstream outage. And we wasted nearly an hour troubleshooting our own code before we thought to look outside our walls.

That was the moment CheckUpstream started.

We kept making the same mistake

Here's what surprised us: this wasn't a one-off. It happened with Stripe. It happened with OpenAI when they had a bad week of rate-limiting issues. It happened with Vercel during a DNS propagation problem that made our app look broken in certain regions. Every single time, the pattern was identical. Users notice, tickets pile up, engineers investigate, and eventually someone thinks to check the provider's status page.

The gap between "something is wrong" and "oh, it's not us" was always too long. Thirty minutes, sometimes an hour. We later did the math on what those minutes actually cost. That's an hour of engineers chasing ghosts in their own codebase while customers lose trust.

We started keeping a browser tab open to Stripe's status page. Then one for AWS. Then OpenAI. Then Vercel, Resend, Clerk, Turso. At some point we had eight status page tabs pinned in Chrome, and nobody was actually checking them.

Why nothing else worked

We tried the obvious things. RSS feeds from status pages piped into Slack. Nobody read those channels. StatusCake and similar uptime tools, but those monitor your site, not the services you depend on. We even wrote a janky cron job that scraped a few status pages and posted to a webhook. It broke constantly because status pages change their markup without warning.

The real problem was deeper than monitoring, though. We didn't even have a complete picture of what we depended on. Sure, we knew about Stripe and AWS. But what about the transitive dependencies? The payment processor behind our subscription library. The CDN that our font provider uses. The DNS service that half our stack quietly relies on. You can't monitor what you don't know about.

We needed something that could answer two questions: "What do we actually depend on?" and "Is any of it broken right now?"

What we built

CheckUpstream starts by reading your project. Connect a repo and it parses your package.json, requirements.txt, go.mod, or whatever dependency file your stack uses. It maps each package to the upstream services behind it. stripe npm package maps to Stripe's API. @aws-sdk/client-s3 maps to AWS S3. openai maps to OpenAI's API. You get a dependency graph of every third-party service your application touches, without manually listing anything.

Then we monitor. Every five minutes, we check the status of every service in your graph. Not just the top-level status page either. We track individual components, so if AWS has an S3 problem but EC2 is fine, you only get alerted about S3.

But status pages are slow. Providers sometimes take 20 to 30 minutes to acknowledge an incident. So we also built community detection. When multiple CheckUpstream users start seeing elevated error rates from the same provider at the same time, we flag it. You hear about problems before the provider's own status page updates.

For teams that want deeper visibility, our SDK drops into your application in a few lines of code. It measures real latency and error rates against your upstream APIs from production. Not synthetic checks from a data center in Virginia. Actual measurements from your actual traffic.

What it feels like

The best way we can describe it: CheckUpstream turns a panic into a shrug.

Before, an upstream outage meant confusion, then investigation, then that sinking realization that you can't fix it because it's not your problem. Now it's a Slack message that says "Stripe Payments API is degraded, 12 of your endpoints depend on this service, here's what your users will experience."

You already know the blast radius before your users feel it. You can post a status update within minutes, not after an hour of debugging. Your on-call engineer doesn't have to dig through dashboards at 2 AM wondering if the last deploy broke something.

The alerts are deliberately opinionated. We don't send you every minor status page update. We filter for things that actually affect your stack. If AWS has an issue in ap-southeast-2 and all your infra runs in us-east-1, you won't hear about it. We've been on the receiving end of noisy alerts and we refuse to build another tool that trains people to ignore notifications.

Where we're going

Right now, CheckUpstream tells you when something is broken and what it affects. That's the foundation. But we want to go further.

We're working on predictive signals, things like gradual latency increases that suggest a provider is struggling before they declare an incident. We're building deeper manifest parsing so we can trace dependencies through your entire stack, not just the top-level packages. And we're expanding our provider coverage, because the long tail of SaaS services is enormous and growing.

We're also building this in the open. Our provider mappings, the dataset that connects npm packages to upstream services, will be open source. If you know that @acme/widget depends on a specific API that we haven't mapped yet, you'll be able to contribute that knowledge and everyone benefits.

We built CheckUpstream because we were tired of being the last to know. If you've ever spent an hour debugging a problem that turned out to be someone else's outage, this is for you.