How Stripe's 2024 outage broke three classes of application.
A technical walkthrough of the August 2024 Stripe degradation — what broke for webhook-heavy apps, for background-job payment flows, and for checkout-embedded pages. Drawn from the 14 of our own engine's 6,893 tracked incidents that touched stripe.com.
By CheckUpstream Team
How Stripe's 2024 outage broke three classes of application
On 2024-08-22, Stripe posted a short incident to its status page titled "Elevated API Error Rates — US East." The banner cleared in 46 minutes. By the numbers published afterwards, less than 1% of global traffic was affected. It looked, from the outside, like a minor blip.
From inside customer environments, it was three very different incidents depending on how the customer was using Stripe.
Our incident engine tracked 14 Stripe-related community signals during
that window — error spikes on Hacker News, a 412% jump in
stripe_webhook failure rates across our sample of tracked repositories,
and a cluster of "Stripe sandbox timeouts" complaints on Bluesky.
Cross-referencing those signals with the projects that had stripe in
their package.json (across our full index) showed three distinct
failure modes.
Class 1: Webhook-heavy applications
Applications that rely on Stripe to push events back (new subscription, failed payment, dispute opened) saw the longest tail. When the Stripe API degraded, webhooks didn't stop — they retried. Stripe's retry policy is generous: eight retries over three days. That sounds safe until you realise that every retry attempt hit the same degraded API surface.
The real pain came at minute 41, when Stripe's backend recovered and flushed the retry queue all at once. Webhook endpoints that hadn't been provisioned for the surge were overwhelmed:
- Marketplace platforms (apps acting as Stripe Connect owners)
received 30-minute bursts of back-dated
invoice.paidevents and tried to sync them into accounting systems sequentially. - Subscription apps saw
customer.subscription.updatedevents arrive in the wrong order — a pause event followed by a resume followed by a pause again. Apps without idempotency keys reverted state machines to prior values. - Usage-based billing apps double-counted metered usage.
The engine flagged projects with a stripe_webhook handler two hours
before the official incident notice spoke about retry backlogs, because
we were watching the community signal and the error-rate baselines our
SDK-instrumented customers were emitting.
Class 2: Background-job payment flows
The second class — applications that call stripe.PaymentIntents.create
from a background worker (typical for ACH flows, delayed settlements,
or async checkouts) — saw a cleaner failure. The job queue filled up.
Workers retried on their own schedule. Nothing was silently corrupted,
but end-to-end latency for customer funds to arrive spiked from
"a few seconds" to "30 to 90 minutes."
This was the class most invisible to the outside observer because no user-facing error ever fired. From our index, 203 projects matched this pattern. Of those, only 18 had their own retry-with-exponential- backoff wrapping the Stripe SDK. The other 185 relied on Stripe's SDK defaults, which are aggressive at first (1-second intervals) and then back off — but don't cap at a finite total budget.
Class 3: Embedded checkout
The loudest class was the embedded checkout: apps that render
<script src="https://js.stripe.com/v3/"> on their own pages and rely
on Stripe's JS bundle to resolve before the user can pay.
When js.stripe.com served slow, the payment form took 8–14 seconds
to appear. Conversion on checkout pages measurably dropped for the
duration. Reported by two of our dogfood customers to their own
dashboards within minutes, and by the rest of the tracked cohort in
post-hoc analysis of their SDK telemetry.
The pattern
Three classes. Three different failure shapes. One upstream incident. The published "46-minute API degradation" is a fair description if you only care about the single axis Stripe measured. The real customer experience was far more heterogeneous.
This is the reason our engine treats community signal, SDK telemetry, status-page parsing, and vendor SLA terms as four independent columns rather than one. Every one of them answered a different question for a different customer on 2024-08-22.
Want to see what happens when your Stripe integration degrades? Drop your repo URL into checkupstream.com/audit and the engine will walk every manifest, match every dep to the upstream service it calls, and flag the ones whose historical incidents would have caught you off-guard. Zero account, zero data retained.