Anthropic's API enforces three rate-limit dimensions per model class: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). Hit any one of them and you get a 429. There is also a second failure mode that is not a rate limit at all: the 529 overloaded_error, which fires when Anthropic's own capacity is saturated regardless of your tier. Both look like the same problem to your application. They are not.
This guide covers what the limits are in May 2026, the difference between 429 and 529, how to retry safely with exponential backoff and jitter, and how a gateway with provider fallback (Bedrock, Vertex) keeps your agent alive when the direct API is having a bad afternoon.
For the broader picture of why every production LLM app eventually adds a proxy layer, see our LLM gateway pillar.
TL;DR
- Anthropic uses four Build Tiers (1, 2, 3, 4) plus Monthly Invoicing. You advance by hitting cumulative credit-purchase thresholds ($5, $40, $200, $400).
- Each tier has separate RPM, ITPM, and OTPM limits per model family (Opus 4.x, Sonnet 4.x, Haiku 4.5).
- A 429 means you exceeded a limit. Respect the
retry-afterheader. - A 529 means Anthropic is overloaded. Retry with exponential backoff, but also fail over to another provider.
- For most current models, only uncached input tokens count toward ITPM. Prompt caching effectively raises your ceiling.
- Production answer: backoff + jitter on transient errors, fallback to Bedrock or Vertex Claude on 529, and put a gateway in front of all of it.
The Build Tier system
Anthropic ships rate limits and monthly spend caps together, indexed by the credit you have purchased. As of May 2026 the tiers are:
| Tier | Cumulative Credit Purchase | Monthly Spend Limit |
|---|---|---|
| 1 | $5 | $100 |
| 2 | $40 | $500 |
| 3 | $200 | $1,000 |
| 4 | $400 | $200,000 |
| Monthly Invoicing | N/A | No limit |
You advance automatically when the cumulative credit threshold is met. Custom limits beyond Tier 4 require a sales conversation.
Tier 1 rate limits (Messages API)
| Model | RPM | ITPM | OTPM |
|---|---|---|---|
| Opus 4.x | 50 | 500,000 | 80,000 |
| Sonnet 4.x | 50 | 30,000 | 8,000 |
| Haiku 4.5 | 50 | 50,000 | 10,000 |
Tier 2
| Model | RPM | ITPM | OTPM |
|---|---|---|---|
| Opus 4.x | 1,000 | 2,000,000 | 200,000 |
| Sonnet 4.x | 1,000 | 450,000 | 90,000 |
| Haiku 4.5 | 1,000 | 450,000 | 90,000 |
Tier 3
| Model | RPM | ITPM | OTPM |
|---|---|---|---|
| Opus 4.x | 2,000 | 5,000,000 | 400,000 |
| Sonnet 4.x | 2,000 | 800,000 | 160,000 |
| Haiku 4.5 | 2,000 | 1,000,000 | 200,000 |
Tier 4
| Model | RPM | ITPM | OTPM |
|---|---|---|---|
| Opus 4.x | 4,000 | 10,000,000 | 800,000 |
| Sonnet 4.x | 4,000 | 2,000,000 | 400,000 |
| Haiku 4.5 | 4,000 | 4,000,000 | 800,000 |
Notes that matter in production:
- Opus 4.x limits apply to combined traffic across all Opus 4.x versions (4.7, 4.6, 4.5, 4.1, 4.0). You do not get a fresh bucket per version.
- Sonnet 4.x is similarly pooled across 4.6, 4.5, and 4.
cache_read_input_tokensdoes not count toward ITPM for current models. That is a real lever (more on that later).- The token bucket replenishes continuously. A "60 RPM" limit is effectively "1 RPS, plus burst headroom."
429 vs 529: same shape, different problem
Both errors come back as 4xx/5xx with a JSON body. The shape is similar, the cause is not.
HTTP 429 rate_limit_error means you exceeded one of your tier's limits. The response includes a retry-after header in seconds and a type of rate_limit_error. You should:
- Read
retry-afterand wait at least that long. - Reduce concurrency on your side so the next minute does not also fail.
- If you keep hitting this, you have a capacity problem that retries will not fix. Upgrade tier, enable prompt caching, or shed load.
HTTP 529 overloaded_error means Anthropic itself is over capacity. The response body has type: "overloaded_error". The error has nothing to do with your usage. There is no retry-after value you can trust. You should:
- Retry with exponential backoff and jitter.
- After a couple of retries, fail over to another provider hosting the same model (AWS Bedrock or GCP Vertex AI).
- Capture the event so you can correlate with Anthropic's status page after the fact.
A 503 from Anthropic also appears occasionally during incidents. Treat it like a 529 for retry purposes.
Exponential backoff with jitter
The textbook retry pattern. Wait time grows exponentially, but each retry adds randomized jitter so retrying clients do not synchronize and stampede the API at the same instants.
Python
import os
import random
import time
import anthropic
from anthropic import RateLimitError, APIStatusError
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def call_claude_with_retry(messages, model="claude-sonnet-4-6", max_retries=5):
for attempt in range(max_retries):
try:
return client.messages.create(
model=model,
max_tokens=1024,
messages=messages,
)
except RateLimitError as e:
# 429: respect retry-after if present
retry_after = float(e.response.headers.get("retry-after", 2 ** attempt))
sleep_s = retry_after + random.uniform(0, 1)
time.sleep(sleep_s)
except APIStatusError as e:
# 529 overloaded, 503, and other transient 5xx
if e.status_code in (529, 503, 502, 504):
sleep_s = (2 ** attempt) + random.uniform(0, 1)
time.sleep(sleep_s)
else:
raise
raise RuntimeError("Exhausted retries against Claude API")The key details: respect retry-after for 429s, use pure exponential plus jitter for 529s, cap the retry count so you do not block a request forever.
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
async function callClaudeWithRetry(
messages: Anthropic.MessageParam[],
model = "claude-sonnet-4-6",
maxRetries = 5,
) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.messages.create({
model,
max_tokens: 1024,
messages,
});
} catch (err) {
const e = err as Anthropic.APIError;
if (e.status === 429) {
const retryAfter = Number(e.headers?.["retry-after"] ?? 2 ** attempt);
await sleep(retryAfter * 1000 + Math.random() * 1000);
} else if ([529, 503, 502, 504].includes(e.status ?? 0)) {
await sleep(2 ** attempt * 1000 + Math.random() * 1000);
} else {
throw err;
}
}
}
throw new Error("Exhausted retries against Claude API");
}
const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));The Anthropic Python and TypeScript SDKs both ship a built-in retrying client that handles much of this for you (max_retries on the constructor), but in production you usually want the extra control above so you can log retries, emit metrics, and fail over.
Reading the rate-limit headers
Every successful response carries headers that tell you how much room you have left. Watch these in your gateway or telemetry:
| Header | Meaning |
|---|---|
anthropic-ratelimit-requests-remaining | RPM budget left this window |
anthropic-ratelimit-input-tokens-remaining | ITPM budget left (rounded to nearest 1k) |
anthropic-ratelimit-output-tokens-remaining | OTPM budget left (rounded to nearest 1k) |
anthropic-ratelimit-input-tokens-reset | RFC 3339 timestamp when the bucket refills |
retry-after | Only present on 429; seconds to wait |
A simple production guardrail: if input-tokens-remaining drops below 10% of input-tokens-limit, your gateway should start shedding low-priority traffic (background classification, eval runs) before the user-facing path starts seeing 429s.
Prompt caching as a rate-limit multiplier
For current Claude models (anything without the † marker in Anthropic's docs), cache_read_input_tokens do not count toward ITPM. They are also billed at 10% of base input price.
The implication is large. With a 2,000,000 ITPM ceiling on Sonnet 4.x at Tier 4 and an 80% cache hit rate, you can process roughly 10M input tokens per minute, because 8M of them are cached reads that the rate limiter ignores. Caching is the cheapest rate-limit upgrade you can ship. For the full pattern, see our Claude prompt caching guide.
Provider fallback: surviving 529s
When Anthropic is overloaded, retries on Anthropic do not help. The same model is available on AWS Bedrock and GCP Vertex AI, on separate capacity. A production setup runs all three and fails over.
The clean way is to route everything through a gateway. Your application calls one endpoint with one schema; the gateway tries Anthropic first, falls over to Bedrock Claude on 529, then to Vertex Claude if Bedrock is also struggling.
from openai import OpenAI
# Respan gateway, OpenAI-compatible interface
client = OpenAI(
base_url="https://api.respan.ai/v1",
api_key=os.environ["RESPAN_API_KEY"],
)
# Single call, gateway handles fallback chain configured server-side:
# anthropic/claude-sonnet-4-6 -> bedrock/claude-sonnet-4-6 -> vertex/claude-sonnet-4-6
resp = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[{"role": "user", "content": "Summarize this report..."}],
)The same call in TypeScript:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.respan.ai/v1",
apiKey: process.env.RESPAN_API_KEY,
});
const resp = await client.chat.completions.create({
model: "anthropic/claude-sonnet-4-6",
messages: [{ role: "user", content: "Summarize this report..." }],
});Why this matters operationally: Bedrock and Vertex have their own rate limits enforced on AWS and GCP IAM (your account's per-region quotas). Running primary plus two failovers gives you roughly 3x effective headroom for a busy hour, with no application code changes.
For a gateway comparison, see Best LLM Gateways in 2026.
A working production pattern
- Tier up early. Even moderate production traffic needs Tier 3 or 4. Pre-purchase the credit; do not wait for the first 429 outage to do this.
- Enable prompt caching on long system prompts and RAG context. This pushes effective ITPM up by 3-5x.
- Wrap every Claude call in backoff + jitter. Cap retries at 4 or 5 attempts.
- Configure provider fallback to Bedrock or Vertex for the same model.
- Emit metrics for
retry_count,final_status, andprovider_usedper call. Most 529 storms are short, but you only know that if you can graph them. - Set a workspace-level rate limit per feature so a runaway batch job cannot starve your interactive UX.
FAQ
What is the difference between 429 and 529? 429 means your account exceeded a rate limit. 529 means Anthropic itself is overloaded. Retry both, but 529 also needs provider fallback because more retries against Anthropic will not help.
How do I upgrade my tier? Buy more credits. Tier 2 requires $40 cumulative, Tier 3 requires $200, Tier 4 requires $400. The upgrade is automatic once you cross the threshold.
Do cached tokens count toward rate limits?
Not for current models. cache_read_input_tokens do not count toward ITPM and are billed at 10% of base input price. This is the single biggest lever for stretching your tier.
What is the retry-after header?
On a 429 response, Anthropic returns retry-after in seconds. Wait at least that long before retrying. Earlier retries are guaranteed to fail.
Can I set per-workspace rate limits? Yes. You can cap individual workspaces below your organization limit. This is the standard pattern for protecting interactive traffic from batch jobs.
Should I use Anthropic's SDK auto-retry or my own? The SDK auto-retries simple cases. For production, write your own so you can log, metric, and fail over. The SDK does not know about your Bedrock fallback.
Is the Batches API rate-limited separately? Yes. The Batches API has its own RPM and processing-queue limits per tier, shared across models. See the Batches API guide.