The Anthropic Message Batches API runs Claude calls asynchronously at half the per-token price. You submit up to 100,000 message requests in a single batch; Anthropic processes them within 24 hours; you poll for completion and download results. For offline evals, document processing, classification jobs, and anything else that does not need a streaming response right now, it cuts the bill in half without changing the model output.
This guide covers what the Batches API is good for, the real cost math (where it does and does not pay off), code patterns in Python and TypeScript, the operational gotchas, and how it compares to OpenAI's equivalent. As of May 2026, the Batches API supports all current models (Opus 4.7, Sonnet 4.6, Haiku 4.5) and works the same on Anthropic direct as on Bedrock.
For the broader cost picture, see How to reduce OpenAI API costs; for evals, see How to evaluate an LLM.
TL;DR
- 50% off all per-token rates for batch jobs that complete asynchronously.
- Up to 24 hours to complete; most batches finish much faster.
- Up to 100,000 requests per batch. Tier-dependent queue caps apply.
- Not supported in batch: streaming responses, tool use loops that require multi-turn interaction within a single batch entry (single-shot tool calls work).
- Right use cases: offline evals, document summarization at scale, classification, content generation pipelines, dataset labeling.
- Wrong use cases: interactive UX, agent loops, anything user-facing in real time.
When the Batches API is a no-brainer
The 50% discount only applies to workloads that can tolerate up to 24 hours of latency. That maps cleanly onto a few categories:
- Eval runs. You have 5,000 test cases and want to score a new prompt against all of them. Batch.
- Document processing. Summarize 50,000 PDFs, extract entities from a quarter's worth of legal documents, generate metadata for a media library. Batch.
- Classification at scale. Label a corpus, run sentiment on a backfill, tag content with categories. Batch.
- Synthetic data generation. Produce training data for a downstream model or eval set. Batch.
- Re-processing on model upgrade. When Sonnet 4.7 ships and you want to re-run all your generated content with the new model. Batch.
If a human is waiting at a screen for the answer, use the synchronous API. Everything else, default to batch.
The cost math
Take Sonnet 4.6 at base rates ($3 input, $15 output per million tokens). A batch job of 50,000 calls averaging 2K input tokens and 500 output tokens:
- Sync price: 50,000 x (2K x $3/MTok + 500 x $15/MTok) = 50,000 x ($0.006 + $0.0075) = $675
- Batch price (50% off): $337.50
Half the cost, same model output. The only thing you spend is wall-clock latency. For a nightly job that runs at 2 AM, you do not care.
Prompt caching also stacks with the Batches API, although caching is most useful when you have many requests sharing a common prefix submitted close together in time. For batches with shared system prompts, both discounts apply.
What is supported and what is not
Supported in batch:
- All Messages API models (Opus 4.x, Sonnet 4.x, Haiku 4.5).
- System prompts, multi-turn message arrays, images.
- Prompt caching (cache hits work across batch entries).
- Single-turn tool use (model emits a tool call, batch returns; you process the result in your application).
- Extended thinking on supported models.
Not supported in batch:
- Streaming responses. Batch is request/response only.
- Multi-turn tool use within a single batch request. If your agent needs the model to call tool A, see the result, call tool B, you cannot do that inside one batch entry. You can run each turn as a separate batch, but most agent loops do not fit batch's latency budget anyway.
- Real-time response. The whole point of batch is async.
Python: submitting and polling a batch
import os
import time
import anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
requests = [
{
"custom_id": f"doc-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": f"Summarize: {documents[i]}"}
],
},
}
for i in range(len(documents))
]
batch = client.messages.batches.create(requests=requests)
print(f"Submitted batch {batch.id} with {len(requests)} requests")
# Poll for completion
while True:
batch = client.messages.batches.retrieve(batch.id)
print(f"Status: {batch.processing_status} "
f"({batch.request_counts.succeeded} succeeded, "
f"{batch.request_counts.errored} errored)")
if batch.processing_status == "ended":
break
time.sleep(30)
# Stream results (one JSONL row per request)
for result in client.messages.batches.results(batch.id):
custom_id = result.custom_id
if result.result.type == "succeeded":
message = result.result.message
print(f"{custom_id}: {message.content[0].text[:80]}")
else:
print(f"{custom_id}: failed - {result.result}")Three things worth noting:
custom_idis your handle for matching results back to inputs. Use document IDs, row keys, or anything stable.- Polling cadence. Every 30-60 seconds is fine. Anthropic does not currently push webhooks for batch completion, so polling is the pattern.
- Results stream as JSONL. For very large batches, iterate; do not load all results into memory.
TypeScript: same flow
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const requests = documents.map((doc, i) => ({
custom_id: `doc-${i}`,
params: {
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user" as const, content: `Summarize: ${doc}` }],
},
}));
const batch = await client.messages.batches.create({ requests });
console.log(`Submitted batch ${batch.id}`);
// Poll
let current = batch;
while (current.processing_status !== "ended") {
await new Promise((r) => setTimeout(r, 30_000));
current = await client.messages.batches.retrieve(batch.id);
console.log(
`Status: ${current.processing_status} (${current.request_counts.succeeded} ok, ${current.request_counts.errored} err)`,
);
}
// Iterate results
for await (const result of await client.messages.batches.results(batch.id)) {
if (result.result.type === "succeeded") {
const msg = result.result.message;
const firstBlock = msg.content[0];
if (firstBlock.type === "text") {
console.log(`${result.custom_id}: ${firstBlock.text.slice(0, 80)}`);
}
} else {
console.log(`${result.custom_id}: failed`);
}
}Batch rate limits and queue size
The Batches API has its own quotas, separate from the Messages API limits. As of May 2026:
| Tier | Batch RPM | Max requests in processing queue | Max requests per batch |
|---|---|---|---|
| 1 | 50 | 100,000 | 100,000 |
| 2 | 1,000 | 200,000 | 100,000 |
| 3 | 2,000 | 300,000 | 100,000 |
| 4 | 4,000 | 500,000 | 100,000 |
The "in processing queue" cap is the total across all of your batches that have not yet completed. If you are running large overnight workloads, this is the number that constrains you, not the per-batch ceiling.
Gotchas worth knowing before you ship
- No streaming. Obvious but worth restating. If your downstream consumer wants partial tokens, batch is wrong.
custom_idmust be unique within a batch. Repeat keys cause the batch to be rejected at submit time.- Failed individual requests do not fail the batch. Each request lands in the JSONL output as
succeededorerrored. Process errors at the row level. - Cancellation is best-effort. You can cancel a batch, but requests already in flight may still complete and be billed.
- Result retention is 29 days. Persist results to your own storage soon after completion if you need them long-term.
- Wall-clock SLA is 24 hours. Most batches finish in minutes to hours; budget for the worst case.
- Prompt caching across batch entries. If many batch entries share a system prompt, place the
cache_controlmarker as usual; cache writes and reads behave the same. The savings compound with the batch discount. - You still pay for cache writes. The 50% discount applies to base token rates, including cache write/read pricing.
Comparison to OpenAI Batch API
OpenAI's Batch API hits the same 50% discount, also async with up to a 24-hour SLA, also up to 50,000 requests per batch. Differences worth knowing:
| Dimension | Anthropic Batches | OpenAI Batch |
|---|---|---|
| Discount | 50% | 50% |
| SLA | Up to 24h | Up to 24h |
| Max requests per batch | 100,000 | 50,000 |
| Submission format | JSON requests array in API call | JSONL file upload |
| Streaming | No | No |
| Tool use | Single-turn only | Single-turn only |
| Result retention | 29 days | Until you delete |
The Anthropic flow feels more API-native; OpenAI's flow goes through their Files API and resembles a job system. For most use cases, either is fine. If you are running both providers (good idea), a gateway can normalize the surface.
A practical batch pattern for evals
The Batches API was practically built for offline eval runs. The shape:
- Prepare a JSONL of test cases, each with an input and expected output (or a rubric).
- Submit one batch entry per test case with a
custom_idmatching the test case ID. - Poll until complete.
- Score each row against the expected output using a separate scorer (could be another batch).
- Aggregate into pass/fail metrics.
Doing this synchronously on 5,000 test cases at Sonnet 4.6 sync rates costs roughly $70 (depending on prompt size). On the Batches API it costs roughly $35, and you do not chew through your interactive Messages API rate limits doing it. See How to evaluate an LLM for the broader pattern.
Observability matters more in batch
In sync code, you see each error as it happens. In a batch, errors arrive in a flat JSONL hours later and you have to reason about them after the fact. A gateway or an eval platform that captures each batch entry as a trace makes this much easier: you get per-call latency, token counts, errors, and the actual prompt and response stored together, all queryable. See LLM Observability.
FAQ
How much does the Anthropic Batches API cost? 50% off the standard per-token rates for the model you batch against. Cache reads and writes get the same 50% discount applied.
How long does a batch take? SLA is up to 24 hours. Real-world latency is usually minutes to a few hours, depending on batch size and current load.
Can I use tool calling in batch? Single-turn tool calls work (the model returns a tool call; your application handles it later). Multi-turn agentic loops within one batch entry are not supported.
Can I cancel a batch? Yes, but cancellation is best-effort. Requests already in flight may complete and bill.
What models are supported? All current Messages API models: Opus 4.7, Sonnet 4.6, Haiku 4.5, plus older snapshots. Confirm the latest list in the Anthropic docs before launching.
Does prompt caching work in batches? Yes, and the savings stack. If 10,000 batch entries share a 20K-token system prompt, cache writes happen once and reads apply across the rest, at half price thanks to the batch discount.
Can I run a batch on Bedrock? AWS Bedrock supports batch inference for Claude models with a similar shape and discount. The API surface differs (you submit a batch inference job through AWS, results land in S3). See Anthropic API vs Bedrock Claude.