Anatomy of a Cascading Failure.
How one DNS provider going down took out authentication, payments, email, and error tracking at the same time. A technical walkthrough of dependency chains.
By CheckUpstream Team
Anatomy of a Cascading Failure
In July 2024, a mid-sized SaaS company had what they initially reported as a "platform-wide outage." Their status page showed every single component red. Dashboard, API, webhooks, authentication, billing, email notifications, even their docs site. All of it, down at once.
Their incident post-mortem told a more interesting story. The root cause wasn't in their infrastructure at all. It was a 22-minute outage at their DNS provider. But the blast radius was total because their entire stack resolved through the same DNS. Auth, payments, email, monitoring. Every outbound connection their system made went through one DNS resolution path, and when that path broke, everything broke with it.
This is what a cascading failure looks like. Not a single dramatic explosion, but a quiet chain reaction where one failure causes another, which causes another, until your entire system is offline and nobody can figure out why because the root cause is three layers removed from the symptoms.
The dependency graph you didn't draw
Every application has two architectures. There's the architecture you drew on a whiteboard, with neat boxes and arrows showing how your services talk to each other. And then there's the real architecture, which includes every external service, every DNS lookup, every certificate authority, every CDN edge node, and every third-party API your code touches at runtime.
The whiteboard version has maybe 5 to 10 boxes. The real version has 40+. And the connections between those boxes create failure paths that aren't obvious until something breaks.
Let me draw a realistic dependency graph for a typical SaaS application:
Your App
├── Auth (Clerk)
│ ├── Clerk API → Clerk's infrastructure
│ └── DNS resolution → Your DNS provider
├── Payments (Stripe)
│ ├── Stripe API → Stripe's infrastructure
│ ├── Stripe webhooks → Your webhook endpoint
│ └── DNS resolution → Your DNS provider
├── Database (PlanetScale)
│ ├── PlanetScale connection → PlanetScale's infrastructure
│ └── DNS resolution → Your DNS provider
├── Email (Resend)
│ ├── Resend API → Resend's infrastructure
│ └── DNS resolution → Your DNS provider
├── Error Tracking (Sentry)
│ ├── Sentry SDK → Sentry's infrastructure
│ └── DNS resolution → Your DNS provider
├── Hosting (Vercel)
│ ├── Edge network → Vercel's infrastructure
│ ├── SSL certificates → Certificate authority
│ └── DNS resolution → Your DNS provider
└── CDN (Cloudflare)
├── Edge caching → Cloudflare's infrastructure
└── DNS resolution → (Cloudflare IS the DNS provider)
See the pattern? Every single branch goes through DNS resolution. If your DNS provider is separate from your CDN (and for many teams it is), then DNS is a single point of failure that connects to literally everything.
How cascades actually propagate
The tricky thing about cascading failures is that they don't happen all at once. They unfold over seconds and minutes in a specific sequence, and understanding that sequence is key to catching them early.
Here's how the DNS outage I mentioned earlier actually played out:
Second 0: DNS provider starts returning SERVFAIL for some queries.
Seconds 1 through 5: Your application tries to resolve api.stripe.com. The DNS lookup fails. The Stripe SDK throws a connection error. But you have retry logic, so it tries again. The retry also fails.
Seconds 5 through 15: Authentication checks start failing too, because api.clerk.com also won't resolve. Now new user sessions can't be created. Existing sessions might still work if they're cached, but that depends on your session architecture.
Seconds 15 through 30: Your error tracking stops working because o123456.ingest.sentry.io can't be resolved either. This is the really dangerous moment. Your monitoring just went blind at the exact moment you need it most. Errors are happening, but Sentry can't receive them.
Seconds 30 through 60: Database connections might still be alive if they were established before the DNS outage (TCP connections don't need DNS once they're open). But if any connection in your pool drops and needs to reconnect, the new connection will fail DNS resolution.
Minutes 1 through 5: Your health check endpoint starts failing because it tries to verify connectivity to downstream services. Your load balancer sees unhealthy instances and starts routing traffic to fewer servers. If enough instances fail health checks, you get a capacity crunch on top of everything else.
Minutes 5 through 10: Queued jobs start piling up. Background workers that send emails, process webhooks, or sync data are all failing because every outbound connection requires DNS. Your queue depth grows rapidly.
One failure. Six downstream effects. Each one compounding the others.
The hidden connectors
DNS is the most dramatic example, but there are other shared dependencies that create cascade paths. Here are the ones we see most often:
Certificate authorities. Your HTTPS connections depend on SSL certificates, which depend on OCSP stapling and CRL checks. If a certificate authority has problems, TLS handshakes fail. This is rare, but when it happens, it's brutal because literally every HTTPS connection is affected.
Cloud provider regions. If your app, your database, your cache, and your queue are all in AWS us-east-1, then a regional outage takes out everything simultaneously. This happened in December 2021, and it was a wake-up call for a lot of teams.
Shared authentication. If you use a single SSO provider for everything (your cloud console, your monitoring dashboard, your CI/CD pipeline, your deployment tool), then an SSO outage locks you out of all of your operational tools at the exact moment you need them to respond to an incident. We've seen teams unable to log into their own AWS console during an outage because their SSO provider was also affected.
CDN as a gateway. If all your traffic routes through a single CDN or reverse proxy, that's a choke point. Cloudflare going down in June 2022 took out a huge percentage of the internet, including many tools that teams needed to diagnose why their sites were down.
Mapping the blast radius
The way to defend against cascading failures isn't to eliminate all external dependencies (that's impossible). It's to understand the blast radius of each dependency and make conscious decisions about which risks you accept.
Start by asking three questions for each external service you depend on:
1. What fails if this service goes down?
Not just the obvious primary function. Think about the secondary effects. If your email provider goes down, you can't send transactional emails. But you also can't send password reset links, which means users who forgot their password are completely locked out.
2. Does this service share infrastructure with other services I depend on?
This is where cascade paths hide. Your monitoring service and your application might both be hosted on AWS. Your auth provider and your payment processor might share a CDN. Look for these overlapping dependency chains.
3. Can I still operate (even in degraded mode) without this service?
Some dependencies are truly critical. You can't process payments without a payment processor. But you can queue them for retry. You can't authenticate new users without your auth provider. But you can serve cached sessions. Understanding what degraded mode looks like for each dependency helps you build resilience without over-engineering.
What monitoring catches
Automated upstream monitoring won't prevent cascading failures. Nothing will, short of eliminating external dependencies entirely. But it shrinks the time between "something is wrong" and "we know exactly what's wrong."
When you're monitoring the status of all your upstream dependencies, a cascade becomes obvious almost immediately. Instead of seeing "everything is broken" and starting a panicked investigation, you see "DNS provider is reporting an incident" and immediately understand why auth, payments, and email are all failing at the same time. Of course, that assumes the status page has actually been updated, which is why multi-source monitoring matters.
That understanding is worth 10 to 15 minutes of incident response time. And in a cascade, where every minute sees the blast radius expand, those minutes are the difference between a contained incident and a total platform outage.
The dependency graph is already in your codebase. Your package.json is a liability map, and your docker-compose.yml, your Terraform modules all contain implicit information about what your application needs to run. Making that implicit knowledge explicit, and monitoring it continuously, is the simplest form of cascade prevention.
You can't stop your DNS provider from having a bad day. But you can find out about it in seconds instead of minutes. And when the cascade starts, seconds are all you've got.