OpenAI rate limits are the single most common production incident you will hit on the API, ahead of model quality issues, latency spikes, and even outages. They are also the most fixable. The tier system rewards spend, the 429 response carries every signal you need to react well, and the right fallback pattern (Azure OpenAI or Bedrock behind a gateway) makes throttling invisible to users.
This guide covers the tier system as of May 2026, the four limit dimensions (RPM, TPM, RPD, TPD), the 429 error headers, exponential-backoff code you can paste into a service today, and the gateway pattern that turns a rate-limit incident into a logline.
For the broader story on multi-provider routing and fallback, see our LLM Gateway pillar.
TL;DR
- OpenAI has 5 usage tiers based on cumulative spend plus a minimum waiting period. Tier 1 starts at $5 spent; Tier 5 requires $1,000 cumulative spend and 30 days since first payment.
- Limits are enforced on four dimensions: RPM (requests per minute), TPM (tokens per minute), RPD (requests per day), and TPD (tokens per day) on some models.
- A 429 response includes
x-ratelimit-remaining-*andretry-after-msheaders. Always read them. Do not blindly sleep for a fixed interval. - Exponential backoff with jitter is the baseline retry strategy. The
tenacitylibrary (Python) orp-retry(TypeScript) handles it cleanly. - The real fix at scale is a gateway with provider fallback: when OpenAI is throttled, route the same request to Azure OpenAI, Bedrock, or another mirror without changing app code.
The tier system
OpenAI's rate limits scale with your cumulative spend on the platform. Every account starts in Free or Tier 1; you advance automatically as you spend, with no support ticket required. As of May 2026:
| Tier | Qualification |
|---|---|
| Free | New account, before any payment |
| Tier 1 | $5 paid total |
| Tier 2 | $50 paid + 7 days since first payment |
| Tier 3 | $100 paid + 7 days since first payment |
| Tier 4 | $250 paid + 14 days since first payment |
| Tier 5 | $1,000 paid + 30 days since first payment |
The waiting period is a hard floor. You can spend $10,000 on day one and still wait the calendar days before advancing to Tier 5. This is intentional: it prevents fraud spikes from getting full-fat rate limits before the platform can vet the account.
Tier advancement is automatic. You do not file a request. When both conditions are met, your limits jump on the next request. You can check your current tier and limits in the platform settings under Limits.
Spend that counts is cumulative all-time across your organization, including credits consumed, not just the most recent month. Promo and grant credits also count toward tier qualification in most cases (verify on the live tier page if relying on this).
The four limit dimensions
OpenAI enforces limits on up to four dimensions per model. Hit any one of them and you get a 429.
- RPM (requests per minute). The number of API calls you can make per rolling 60-second window. The blunt instrument; easy to hit on bursty workloads.
- TPM (tokens per minute). The total input + output tokens you can process per rolling 60-second window. The realistic constraint for long-context work; a single GPT-5.5 call with 100k context tokens can eat a huge fraction of your TPM budget.
- RPD (requests per day). A daily request cap. Mostly enforced on free-tier accounts and on certain low-tier model snapshots.
- TPD (tokens per day). A daily token cap, applied on some models, again mostly relevant on lower tiers.
Indicative limits (verify the live page for your tier and model, these change):
- Tier 1, GPT-5.4: about 500 RPM and 500,000 TPM.
- Tier 5, GPT-5.5: about 15,000 RPM and tens of millions of TPM, plus a batch queue limit measured in the billions of tokens.
The right way to reason about limits: you have a token budget that refills every 60 seconds. Every request reserves tokens against that budget at submission. If your reserved tokens (estimated from input plus max_completion_tokens) would exceed the remaining TPM, the request is rejected with a 429 immediately, before any generation happens.
Reading a 429 response
When you hit a limit, the API returns HTTP 429 with a body that looks like this:
{
"error": {
"message": "Rate limit reached for gpt-5.4 in organization org-xxx on tokens per min. Limit: 500000 / min. Current: 487234 / min.",
"type": "rate_limit_exceeded",
"code": "rate_limit_exceeded"
}
}The body tells you which dimension you hit. The headers tell you when to retry. Read these on every response, not just on errors:
x-ratelimit-limit-requests/x-ratelimit-limit-tokens: your current budget ceiling.x-ratelimit-remaining-requests/x-ratelimit-remaining-tokens: how much is left in the current window.x-ratelimit-reset-requests/x-ratelimit-reset-tokens: time until the window resets, formatted like6sor1m30s.retry-after-ms(on 429 only): the server's recommendation in milliseconds. This is the single most useful header for retry logic.
The retry-after-ms value is calibrated. It is not a generic "wait a bit"; it is OpenAI telling you how long until you have enough budget for the request you just sent. Honor it.
Exponential backoff with jitter
When you do retry, the standard pattern is exponential backoff with jitter. Pure exponential backoff causes thundering-herd retries when many clients hit the limit at once; jitter spreads them out.
In Python with tenacity:
import openai
from openai import OpenAI
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
retry_if_exception_type,
)
client = OpenAI()
@retry(
retry=retry_if_exception_type(openai.RateLimitError),
wait=wait_random_exponential(min=1, max=60),
stop=stop_after_attempt(6),
reraise=True,
)
def chat(messages):
return client.chat.completions.create(
model="gpt-5.4",
messages=messages,
)wait_random_exponential(min=1, max=60) waits a random value in [0, min(60, 2^attempt)] seconds between attempts. After 6 attempts (about 2 minutes of total wait in the worst case), it gives up and reraises the original exception. That is the right behavior for a request-response service: surface the failure rather than block indefinitely.
The same pattern in TypeScript with p-retry:
import OpenAI from "openai";
import pRetry, { AbortError } from "p-retry";
const client = new OpenAI();
async function chat(messages: any[]) {
return pRetry(
async () => {
try {
return await client.chat.completions.create({
model: "gpt-5.4",
messages,
});
} catch (err: any) {
// Only retry 429 and 5xx. Abort on auth, bad request, etc.
if (err.status && err.status < 500 && err.status !== 429) {
throw new AbortError(err);
}
throw err;
}
},
{
retries: 5,
factor: 2,
minTimeout: 1_000,
maxTimeout: 60_000,
randomize: true,
}
);
}Two refinements that pay off:
- Honor
retry-after-mswhen present. Instead of jittered exponential, sleep for exactly that duration on the first retry. Fall back to jittered exponential if the header is missing or after the first retry. - Do not retry idempotently unsafe operations. Chat completions are safe to retry. A fine-tune job submission is not. Check the endpoint.
What not to retry
A 429 is retryable. These are not:
- 401 Unauthorized. Your API key is wrong. Retrying does not fix it.
- 403 Forbidden. Your account or org cannot access this model or endpoint.
- 400 Bad Request. Your payload is malformed. Schema, parameter, or content-policy issue.
- 429 with
insufficient_quota. This is the "you ran out of credits" 429, not the rate-limit 429. Check thecodefield; do not retry, top up.
The last one trips people up. A 429 from rate_limit_exceeded is a "wait and try again" condition. A 429 from insufficient_quota is "you are out of money." Backing off and retrying does not help; you need someone to add credits. See OpenAI API Credits for the billing side.
Estimating tokens before you send
The cleanest way to avoid 429s is to track your own usage and slow yourself down before OpenAI does. Before each request, estimate token count (input tokens via tiktoken plus your max_completion_tokens setting), check against a local token bucket sized to your TPM, and sleep if you are about to exceed it.
A single-process bucket is a fine first pass. At scale across multiple processes or services, you need a shared bucket in Redis, a sidecar, or a gateway. That is where most teams move to a gateway.
Advancing through the tiers
Strategies to advance faster:
- Top up in larger increments. A single $1,000 top-up counts as $1,000 paid, same as 200 $5 top-ups, but with much less platform friction.
- Avoid letting your balance hit zero. If your
insufficient_quota429s pile up, your account looks unstable. Auto-recharge avoids this. - Concentrate spend in one org. Tier qualification is per organization. Spreading $1,000 across three orgs leaves all three at Tier 3 or lower.
- Do not ask support to upgrade you manually. OpenAI does not generally process manual tier upgrade requests in 2026; advancement is automatic once you meet the threshold.
Provider fallback: the real fix at scale
Past about 100 RPS sustained, in-process backoff is not enough. The right architecture is multi-provider: OpenAI as primary, Azure OpenAI as secondary mirror (same models, separate rate-limit pool), Bedrock or Vertex as tertiary for non-OpenAI models. When primary throttles, you fall over to a mirror with the same model name, transparently.
A gateway is the natural place to implement this. The app code calls one endpoint. The gateway tries OpenAI first; on 429 (and 5xx and timeout), it retries against Azure within the same request. The client sees a successful response. The 429 becomes a logline instead of a failed call.
from openai import OpenAI
# Point your client at the gateway, not OpenAI directly.
client = OpenAI(
base_url="https://api.respan.ai/v1",
api_key=os.environ["RESPAN_API_KEY"],
)
# Fallback configured on the gateway side:
# primary: openai/gpt-5.4
# fallback: azure/gpt-5.4
# on: [429, 500, 502, 503, 504, timeout]
client.chat.completions.create(
model="gpt-5.4",
messages=[...],
)The fallback is invisible to your code. The gateway emits metrics so you can see how often the secondary is firing; if it is more than 5 percent of traffic, you have a real capacity problem and should advance tiers or split traffic.
See Best LLM Gateways in 2026 for vendor comparisons, and What is an LLM Gateway for the conceptual model.
Common gotchas
- Retrying on 4xx errors that are not 429. A 400 is permanent. Retrying wastes time and clutters your logs.
- Fixed
time.sleep(5)between retries. Works for one client; thunders the herd when many clients hit the limit simultaneously. Use jittered exponential. - Not capping retry count. A retry loop with no
stop_after_attemptwill block your request handler indefinitely on a sustained outage. Always cap. - Ignoring
retry-after-ms. It is more accurate than any heuristic. Use it on the first retry. - Per-process token bucket in a multi-process service. Each process thinks it has the full quota; combined, you blow past the limit. Use a shared bucket or a gateway.
- Conflating
insufficient_quotawith rate limits. Different problem, different fix. Check the error code.
FAQ
How do I check my current rate limits?
The platform Limits page lists every model and your current RPM/TPM/RPD/TPD. The response headers (x-ratelimit-limit-*) on every successful call also tell you live.
Can I ask OpenAI for higher limits? Limits scale automatically with spend and tier. There is no manual override request for standard models. For enterprise contracts, your account manager handles it.
Does prompt caching count against TPM? Cached tokens still count as input tokens for TPM accounting (you pay less for them, but they consume budget). Plan capacity accordingly.
Are batch API calls subject to rate limits? Batch has its own queue limit, separate from synchronous RPM/TPM. Submitting a batch with 100,000 requests does not block your live traffic.
What is the difference between rate_limit_exceeded and insufficient_quota?
Both return HTTP 429. The first is "wait, then retry, you will succeed." The second is "you are out of credits, retrying will not help." Check the error.code field.
Should I retry on 500 errors? Yes, with backoff. 500s are typically transient. Use the same retry policy as 429.
How do I shape traffic so no single feature starves the others? Per-feature budgets in a gateway. OpenAI does not natively offer sub-account limits. The gateway lets you set RPM/TPM caps per user, feature, or workspace.