Skip to main content
← Back to Blog

Anatomy of an OpenAI rate-limit cascade.

How a tier-2 rate-limit degradation on the OpenAI API propagated through retry storms, connection pools, and shared queue workers — and why our engine saw it 12 minutes before the status page did.

By CheckUpstream Team

Anatomy of an OpenAI rate-limit cascade

In early 2025 a narrow segment of OpenAI API traffic — specifically gpt-4o-mini requests over 8k tokens — started getting throttled at roughly 60% of the posted tier-2 limit. OpenAI's status page eventually said "Elevated error rates on some models." It went up 12 minutes after the first community signal hit.

Our correlator caught the signal at minute 0. Here is the sequence of events the engine saw, and what the downstream cascade looked like in the codebases that depend on the API.

Timeline: what the engine saw

  • T+0:00 — Hacker News comment on a Show HN thread: "Is OpenAI down? My prototype started 429-ing about 10 minutes ago." First signal into our outage_signal table, sourceType: hacker_news, confidence 0.6.
  • T+0:04 — Bluesky post from an AI developer: "gpt-4o-mini seeing rate limits way below tier 2 for me right now." Weight 0.5. Composite score across openai service now 1.1.
  • T+0:08 — X/Twitter chatter starts. Three independent posts, weight 0.4 each. Composite now 2.3.
  • T+0:12 — OpenAI posts to its official status page: "Investigating elevated error rates." SDK telemetry from two dogfood orgs already showing 4σ deviation above the p50 latency baseline for calls to api.openai.com.
  • T+0:47 — Status page cleared.

The 12-minute lead wasn't magic. It came from watching four distinct public signal streams and weighting them against each other so no single noisy channel (X is noisy) could over-fire. The weights are calibrated weekly against past verified outcomes — signals that failed to correlate to an actual incident in past weeks get their weight trimmed on the next weekly fit.

The cascade pattern

"Rate limit cascade" is the right name because the same class of failure tends to rip through a codebase in three layers.

Layer 1: the call site

The simplest case. Code like:

const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages,
});

When the API returns 429 with no retry handling, the request fails, the user sees an error, the story ends. Messy but bounded.

Layer 2: the retry wrapper

More mature codebases wrap the OpenAI SDK with a retry-on-429 pattern. Typical shape:

async function withRetry<T>(fn: () => Promise<T>, opts) {
  for (let attempt = 0; attempt < opts.max; attempt++) {
    try { return await fn(); }
    catch (err) {
      if (err.status === 429) {
        await sleep(2 ** attempt * 1000);
        continue;
      }
      throw err;
    }
  }
}

Under normal load this wrapper is a good choice — a burst of local backpressure is absorbed by three quick retries. Under a sustained rate-limit degradation, the wrapper becomes the problem: every retry hits the same degraded upstream, and if the client fleet is large, every one of them is retrying on the same schedule. This is the "retry storm" shape: aggregate traffic to the upstream multiplies by the retry count, which makes the degradation last longer.

Our engine flags projects using this pattern in the analyze_project MCP tool — the AST extractor tags every call site whose surrounding scope contains retry|backoff vocabulary.

Layer 3: the shared queue

The deepest cascade happens in codebases where OpenAI calls pipe through a shared background queue. When the queue workers start retrying failed items, they block ahead of non-OpenAI items that would otherwise succeed. A webhook handler that dispatches "send summary email" and "call OpenAI for summary" through the same worker pool sees both stall — because the OpenAI calls are holding worker slots.

The fix — route OpenAI calls to a dedicated worker pool with a capped concurrency that's below your tier-2 rate limit — is well- known to teams that have lived through a cascade, and almost universally under-implemented on teams that haven't.

The insight for our readers

The gap between "community noticed" and "status page posted" is not a quirk of this one incident. It's the median gap. Across the 6,893 incidents our engine tracks, the median community lead over the official status page is 6 minutes; the 75th percentile is 9 minutes; outliers like this one hit 20+.

That lead time is how much notice you get to act on your own system before the vendor officially tells you something's happening. For a retry-storm-prone codebase, 6 minutes is the difference between a clean graceful-degradation and a backlog you clean up for an hour.


Curious whether your code is in layer 1, 2, or 3? Paste your repo URL into checkupstream.com/audit — the engine will map every OpenAI call site, annotate each one with its retry pattern, and tell you where the cascade would hurt.