Skip to main content
← Back to Blog

Why You Should Monitor Your Upstream Dependencies.

A practical guide to setting up dependency monitoring: what to watch, what thresholds to set, and how to respond when something goes wrong.

By CheckUpstream Team

Why You Should Monitor Your Upstream Dependencies

It's 3 AM and your on-call phone goes off. Customers can't log in. You check your services, your database, your recent deploys. Everything looks fine. Forty minutes later, someone opens the Clerk status page. "Investigating elevated error rates." Posted 35 minutes ago.

Forty minutes troubleshooting a problem that was never yours to fix.

This happens constantly because most teams have zero automated visibility into the services they depend on. You monitor your own infrastructure obsessively and treat upstream dependencies as permanently available. They aren't.

This post is the practical playbook: how to set up monitoring, pick the right thresholds, and build a response process that works at 3 AM.

Deciding what to monitor

You can't monitor everything with equal urgency. Trying to will flood your alert channels and train your team to ignore notifications. Instead, categorize dependencies by what breaks when they break.

Tier 1: Revenue and access. Payment processors, auth providers, primary database, core cloud infrastructure. When these go down, users can't pay or can't log in. Tier 1 gets real-time alerts to your on-call channel.

Tier 2: Visible degradation. Email delivery, AI/ML APIs, search, file storage. Users notice when these break, but they can still use most of your product. Tier 2 gets alerts to a team channel, not the pager.

Tier 3: Internal tooling. Error tracking, analytics, CI/CD, logging. Your smoke detectors. They won't break the user experience, but losing them during an incident makes everything harder. Daily digest.

Tier 4: Can wait. Documentation hosting, marketing integrations, non-critical webhooks. Check weekly. Don't alert.

The key question for each dependency: if this goes down right now, how long until a customer notices? If the answer is "immediately," it's Tier 1. If "they probably won't," it's Tier 3 or 4.

Setting thresholds that actually work

Most teams that do set up monitoring get the thresholds wrong. They either alert on every minor status page blip (causing alert fatigue within a week) or set the bar so high that they miss real incidents.

For Tier 1, alert on any status change. If Stripe moves from "Operational" to "Investigating," you want to know. False positives are cheap compared to missing a real incident. You're not waking someone up. You're giving the on-call engineer context if things escalate.

For Tier 2, alert on "degraded" or worse. Minor investigations that resolve in two minutes aren't worth the interruption.

For Tier 3, alert on "major outage" only. You don't need to know that Sentry has "slightly elevated latency." You need to know when it's fully down, because that's when you lose your safety net.

One more threshold: duration. A blip that lasts 90 seconds is noise. Don't escalate to the pager until an incident has been open for at least 3 minutes. This single rule eliminates most false alarms.

Building your response playbook

Monitoring without a response plan is just a fancy way to watch things break. For each Tier 1 dependency, write a playbook that answers four questions:

1. How do we confirm it's really upstream? Cross-reference the status page with your own error rates. If Stripe says "Investigating" and your checkout errors jumped from 0.1% to 15%, that's confirmation. If your error rate is flat, the issue might not affect your endpoints.

2. What do we tell users? Have it pre-written. Draft the in-app banner, the status page update, and the support macro before you need them. At 3 AM, nobody should be wordsmithing.

3. What's our degraded mode? Define what your app does when that service is unavailable. Queue payments for retry? Serve cached sessions while auth is down? Not every service needs a fallback, but you need to have consciously decided "we accept this risk" rather than discovering your plan is "crash and hope" during a real incident.

4. When do we escalate? Define time-based thresholds. Upstream incident open 15 minutes? Notify engineering leadership. 30 minutes? Post a public status update. 60 minutes? Full incident response. Stripe being down for 15 minutes is a crisis. Your analytics provider being down for 15 minutes is a footnote.

The tooling layer

You can build this yourself. Poll status pages, parse RSS feeds, wire up a cron job and a webhook. It works until nobody remembers to add monitoring for the AI provider someone integrated last sprint.

The manual approach breaks in two ways. Someone has to maintain the list of dependencies and status page URLs (it gets stale fast), and status page formats vary wildly: Atlassian Statuspage, Instatus, custom pages, JSON APIs. Parsing all of them is more work than teams expect.

CheckUpstream takes a different approach: it reads your dependency files directly. Point it at a repo and it scans package.json, requirements.txt, go.mod, Cargo.toml, Terraform configs, and dozens of other formats. Each package that maps to a known service gets linked to that service's status feed. New dependency added in a PR? Monitoring picks it up automatically. Alerts route to Slack, Discord, PagerDuty, or email, configured per tier.

The tool matters less than the process, though. If you're doing zero upstream monitoring today, even a shared bookmark folder of status pages and a team agreement to check them first during incidents is a real improvement. Start there. Automate later.

The payoff

Teams that monitor upstream dependencies report the same thing: incident response time drops by 10 to 15 minutes. Not because incidents resolve faster, but because engineers stop wasting time investigating problems that aren't theirs.

Those minutes buy you three things. Faster communication to customers, which builds trust even when things are broken. Less toil chasing ghosts through log files. And better incident data, because cleanly attributing downtime to an upstream cause means your reliability metrics reflect reality.

The dependencies are already in your codebase. The status pages already exist. The only missing piece is the connection between them. Set that up, write the playbooks, and the next time something breaks at 3 AM, you'll know in seconds whether it's your problem or someone else's.