The Anthropic Message Batches API runs Claude calls asynchronously at half the per-token price. You submit up to 100,000 message requests in a single batch; Anthropic processes them within 24 hours; you poll for completion and download results. For offline evals, document processing, classification jobs, and anything else that does not need a streaming response right now, it cuts the bill in half without changing the model output.

This guide covers what the Batches API is good for, the real cost math (where it does and does not pay off), code patterns in Python and TypeScript, the operational gotchas, and how it compares to OpenAI's equivalent. As of May 2026, the Batches API supports all current models (Opus 4.7, Sonnet 4.6, Haiku 4.5) and works the same on Anthropic direct as on Bedrock.

For the broader cost picture, see How to reduce OpenAI API costs; for evals, see How to evaluate an LLM.

TL;DR

50% off all per-token rates for batch jobs that complete asynchronously.
Up to 24 hours to complete; most batches finish much faster.
Up to 100,000 requests per batch. Tier-dependent queue caps apply.
Not supported in batch: streaming responses, tool use loops that require multi-turn interaction within a single batch entry (single-shot tool calls work).
Right use cases: offline evals, document summarization at scale, classification, content generation pipelines, dataset labeling.
Wrong use cases: interactive UX, agent loops, anything user-facing in real time.

When the Batches API is a no-brainer

The 50% discount only applies to workloads that can tolerate up to 24 hours of latency. That maps cleanly onto a few categories:

Eval runs. You have 5,000 test cases and want to score a new prompt against all of them. Batch.
Document processing. Summarize 50,000 PDFs, extract entities from a quarter's worth of legal documents, generate metadata for a media library. Batch.
Classification at scale. Label a corpus, run sentiment on a backfill, tag content with categories. Batch.
Synthetic data generation. Produce training data for a downstream model or eval set. Batch.
Re-processing on model upgrade. When Sonnet 4.7 ships and you want to re-run all your generated content with the new model. Batch.

If a human is waiting at a screen for the answer, use the synchronous API. Everything else, default to batch.

The cost math

Take Sonnet 4.6 at base rates ($3 input, $15 output per million tokens). A batch job of 50,000 calls averaging 2K input tokens and 500 output tokens:

Sync price: 50,000 x (2K x $3/MTok + 500 x $15/MTok) = 50,000 x ($0.006 + $0.0075) = $675
Batch price (50% off): $337.50

Half the cost, same model output. The only thing you spend is wall-clock latency. For a nightly job that runs at 2 AM, you do not care.

Prompt caching also stacks with the Batches API, although caching is most useful when you have many requests sharing a common prefix submitted close together in time. For batches with shared system prompts, both discounts apply.

What is supported and what is not

Supported in batch:

All Messages API models (Opus 4.x, Sonnet 4.x, Haiku 4.5).
System prompts, multi-turn message arrays, images.
Prompt caching (cache hits work across batch entries).
Single-turn tool use (model emits a tool call, batch returns; you process the result in your application).
Extended thinking on supported models.

Not supported in batch:

Streaming responses. Batch is request/response only.
Multi-turn tool use within a single batch request. If your agent needs the model to call tool A, see the result, call tool B, you cannot do that inside one batch entry. You can run each turn as a separate batch, but most agent loops do not fit batch's latency budget anyway.
Real-time response. The whole point of batch is async.

Python: submitting and polling a batch

import os
import time
import anthropic
 
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
 
requests = [
    {
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 1024,
            "messages": [
                {"role": "user", "content": f"Summarize: {documents[i]}"}
            ],
        },
    }
    for i in range(len(documents))
]
 
batch = client.messages.batches.create(requests=requests)
print(f"Submitted batch {batch.id} with {len(requests)} requests")
 
# Poll for completion
while True:
    batch = client.messages.batches.retrieve(batch.id)
    print(f"Status: {batch.processing_status} "
          f"({batch.request_counts.succeeded} succeeded, "
          f"{batch.request_counts.errored} errored)")
    if batch.processing_status == "ended":
        break
    time.sleep(30)
 
# Stream results (one JSONL row per request)
for result in client.messages.batches.results(batch.id):
    custom_id = result.custom_id
    if result.result.type == "succeeded":
        message = result.result.message
        print(f"{custom_id}: {message.content[0].text[:80]}")
    else:
        print(f"{custom_id}: failed - {result.result}")

Three things worth noting:

custom_id is your handle for matching results back to inputs. Use document IDs, row keys, or anything stable.
Polling cadence. Every 30-60 seconds is fine. Anthropic does not currently push webhooks for batch completion, so polling is the pattern.
Results stream as JSONL. For very large batches, iterate; do not load all results into memory.

TypeScript: same flow

import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
 
const requests = documents.map((doc, i) => ({
  custom_id: `doc-${i}`,
  params: {
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{ role: "user" as const, content: `Summarize: ${doc}` }],
  },
}));
 
const batch = await client.messages.batches.create({ requests });
console.log(`Submitted batch ${batch.id}`);
 
// Poll
let current = batch;
while (current.processing_status !== "ended") {
  await new Promise((r) => setTimeout(r, 30_000));
  current = await client.messages.batches.retrieve(batch.id);
  console.log(
    `Status: ${current.processing_status} (${current.request_counts.succeeded} ok, ${current.request_counts.errored} err)`,
  );
}
 
// Iterate results
for await (const result of await client.messages.batches.results(batch.id)) {
  if (result.result.type === "succeeded") {
    const msg = result.result.message;
    const firstBlock = msg.content[0];
    if (firstBlock.type === "text") {
      console.log(`${result.custom_id}: ${firstBlock.text.slice(0, 80)}`);
    }
  } else {
    console.log(`${result.custom_id}: failed`);
  }
}

Batch rate limits and queue size

The Batches API has its own quotas, separate from the Messages API limits. As of May 2026:

Tier	Batch RPM	Max requests in processing queue	Max requests per batch
1	50	100,000	100,000
2	1,000	200,000	100,000
3	2,000	300,000	100,000
4	4,000	500,000	100,000

The "in processing queue" cap is the total across all of your batches that have not yet completed. If you are running large overnight workloads, this is the number that constrains you, not the per-batch ceiling.

Gotchas worth knowing before you ship

No streaming. Obvious but worth restating. If your downstream consumer wants partial tokens, batch is wrong.
custom_id must be unique within a batch. Repeat keys cause the batch to be rejected at submit time.
Failed individual requests do not fail the batch. Each request lands in the JSONL output as succeeded or errored. Process errors at the row level.
Cancellation is best-effort. You can cancel a batch, but requests already in flight may still complete and be billed.
Result retention is 29 days. Persist results to your own storage soon after completion if you need them long-term.
Wall-clock SLA is 24 hours. Most batches finish in minutes to hours; budget for the worst case.
Prompt caching across batch entries. If many batch entries share a system prompt, place the cache_control marker as usual; cache writes and reads behave the same. The savings compound with the batch discount.
You still pay for cache writes. The 50% discount applies to base token rates, including cache write/read pricing.

Comparison to OpenAI Batch API

OpenAI's Batch API hits the same 50% discount, also async with up to a 24-hour SLA, also up to 50,000 requests per batch. Differences worth knowing:

Dimension	Anthropic Batches	OpenAI Batch
Discount	50%	50%
SLA	Up to 24h	Up to 24h
Max requests per batch	100,000	50,000
Submission format	JSON requests array in API call	JSONL file upload
Streaming	No	No
Tool use	Single-turn only	Single-turn only
Result retention	29 days	Until you delete

The Anthropic flow feels more API-native; OpenAI's flow goes through their Files API and resembles a job system. For most use cases, either is fine. If you are running both providers (good idea), a gateway can normalize the surface.

A practical batch pattern for evals

The Batches API was practically built for offline eval runs. The shape:

Prepare a JSONL of test cases, each with an input and expected output (or a rubric).
Submit one batch entry per test case with a custom_id matching the test case ID.
Poll until complete.
Score each row against the expected output using a separate scorer (could be another batch).
Aggregate into pass/fail metrics.

Doing this synchronously on 5,000 test cases at Sonnet 4.6 sync rates costs roughly $70 (depending on prompt size). On the Batches API it costs roughly $35, and you do not chew through your interactive Messages API rate limits doing it. See How to evaluate an LLM for the broader pattern.

Observability matters more in batch

In sync code, you see each error as it happens. In a batch, errors arrive in a flat JSONL hours later and you have to reason about them after the fact. A gateway or an eval platform that captures each batch entry as a trace makes this much easier: you get per-call latency, token counts, errors, and the actual prompt and response stored together, all queryable. See LLM Observability.

FAQ

How much does the Anthropic Batches API cost? 50% off the standard per-token rates for the model you batch against. Cache reads and writes get the same 50% discount applied.

How long does a batch take? SLA is up to 24 hours. Real-world latency is usually minutes to a few hours, depending on batch size and current load.

Can I use tool calling in batch? Single-turn tool calls work (the model returns a tool call; your application handles it later). Multi-turn agentic loops within one batch entry are not supported.

Can I cancel a batch? Yes, but cancellation is best-effort. Requests already in flight may complete and bill.

What models are supported? All current Messages API models: Opus 4.7, Sonnet 4.6, Haiku 4.5, plus older snapshots. Confirm the latest list in the Anthropic docs before launching.

Does prompt caching work in batches? Yes, and the savings stack. If 10,000 batch entries share a 20K-token system prompt, cache writes happen once and reads apply across the rest, at half price thanks to the batch discount.

Can I run a batch on Bedrock? AWS Bedrock supports batch inference for Claude models with a similar shape and discount. The API surface differs (you submit a batch inference job through AWS, results land in S3). See Anthropic API vs Bedrock Claude.

For the broader cost picture, see How to reduce OpenAI API costs; for evals, see How to evaluate an LLM.

TL;DR

50% off all per-token rates for batch jobs that complete asynchronously.
Up to 24 hours to complete; most batches finish much faster.
Up to 100,000 requests per batch. Tier-dependent queue caps apply.
Not supported in batch: streaming responses, tool use loops that require multi-turn interaction within a single batch entry (single-shot tool calls work).
Right use cases: offline evals, document summarization at scale, classification, content generation pipelines, dataset labeling.
Wrong use cases: interactive UX, agent loops, anything user-facing in real time.

When the Batches API is a no-brainer

The 50% discount only applies to workloads that can tolerate up to 24 hours of latency. That maps cleanly onto a few categories:

Eval runs. You have 5,000 test cases and want to score a new prompt against all of them. Batch.
Document processing. Summarize 50,000 PDFs, extract entities from a quarter's worth of legal documents, generate metadata for a media library. Batch.
Classification at scale. Label a corpus, run sentiment on a backfill, tag content with categories. Batch.
Synthetic data generation. Produce training data for a downstream model or eval set. Batch.
Re-processing on model upgrade. When Sonnet 4.7 ships and you want to re-run all your generated content with the new model. Batch.

If a human is waiting at a screen for the answer, use the synchronous API. Everything else, default to batch.

The cost math

Take Sonnet 4.6 at base rates ($3 input, $15 output per million tokens). A batch job of 50,000 calls averaging 2K input tokens and 500 output tokens:

Sync price: 50,000 x (2K x $3/MTok + 500 x $15/MTok) = 50,000 x ($0.006 + $0.0075) = $675
Batch price (50% off): $337.50

Half the cost, same model output. The only thing you spend is wall-clock latency. For a nightly job that runs at 2 AM, you do not care.

What is supported and what is not

Supported in batch:

All Messages API models (Opus 4.x, Sonnet 4.x, Haiku 4.5).
System prompts, multi-turn message arrays, images.
Prompt caching (cache hits work across batch entries).
Single-turn tool use (model emits a tool call, batch returns; you process the result in your application).
Extended thinking on supported models.

Not supported in batch:

Streaming responses. Batch is request/response only.
Multi-turn tool use within a single batch request. If your agent needs the model to call tool A, see the result, call tool B, you cannot do that inside one batch entry. You can run each turn as a separate batch, but most agent loops do not fit batch's latency budget anyway.
Real-time response. The whole point of batch is async.

Python: submitting and polling a batch

import os
import time
import anthropic
 
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
 
requests = [
    {
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 1024,
            "messages": [
                {"role": "user", "content": f"Summarize: {documents[i]}"}
            ],
        },
    }
    for i in range(len(documents))
]
 
batch = client.messages.batches.create(requests=requests)
print(f"Submitted batch {batch.id} with {len(requests)} requests")
 
# Poll for completion
while True:
    batch = client.messages.batches.retrieve(batch.id)
    print(f"Status: {batch.processing_status} "
          f"({batch.request_counts.succeeded} succeeded, "
          f"{batch.request_counts.errored} errored)")
    if batch.processing_status == "ended":
        break
    time.sleep(30)
 
# Stream results (one JSONL row per request)
for result in client.messages.batches.results(batch.id):
    custom_id = result.custom_id
    if result.result.type == "succeeded":
        message = result.result.message
        print(f"{custom_id}: {message.content[0].text[:80]}")
    else:
        print(f"{custom_id}: failed - {result.result}")

Three things worth noting:

custom_id is your handle for matching results back to inputs. Use document IDs, row keys, or anything stable.
Polling cadence. Every 30-60 seconds is fine. Anthropic does not currently push webhooks for batch completion, so polling is the pattern.
Results stream as JSONL. For very large batches, iterate; do not load all results into memory.

TypeScript: same flow

import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
 
const requests = documents.map((doc, i) => ({
  custom_id: `doc-${i}`,
  params: {
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{ role: "user" as const, content: `Summarize: ${doc}` }],
  },
}));
 
const batch = await client.messages.batches.create({ requests });
console.log(`Submitted batch ${batch.id}`);
 
// Poll
let current = batch;
while (current.processing_status !== "ended") {
  await new Promise((r) => setTimeout(r, 30_000));
  current = await client.messages.batches.retrieve(batch.id);
  console.log(
    `Status: ${current.processing_status} (${current.request_counts.succeeded} ok, ${current.request_counts.errored} err)`,
  );
}
 
// Iterate results
for await (const result of await client.messages.batches.results(batch.id)) {
  if (result.result.type === "succeeded") {
    const msg = result.result.message;
    const firstBlock = msg.content[0];
    if (firstBlock.type === "text") {
      console.log(`${result.custom_id}: ${firstBlock.text.slice(0, 80)}`);
    }
  } else {
    console.log(`${result.custom_id}: failed`);
  }
}

Batch rate limits and queue size

The Batches API has its own quotas, separate from the Messages API limits. As of May 2026:

Tier	Batch RPM	Max requests in processing queue	Max requests per batch
1	50	100,000	100,000
2	1,000	200,000	100,000
3	2,000	300,000	100,000
4	4,000	500,000	100,000

Gotchas worth knowing before you ship

No streaming. Obvious but worth restating. If your downstream consumer wants partial tokens, batch is wrong.
custom_id must be unique within a batch. Repeat keys cause the batch to be rejected at submit time.
Failed individual requests do not fail the batch. Each request lands in the JSONL output as succeeded or errored. Process errors at the row level.
Cancellation is best-effort. You can cancel a batch, but requests already in flight may still complete and be billed.
Result retention is 29 days. Persist results to your own storage soon after completion if you need them long-term.
Wall-clock SLA is 24 hours. Most batches finish in minutes to hours; budget for the worst case.
Prompt caching across batch entries. If many batch entries share a system prompt, place the cache_control marker as usual; cache writes and reads behave the same. The savings compound with the batch discount.
You still pay for cache writes. The 50% discount applies to base token rates, including cache write/read pricing.

Comparison to OpenAI Batch API

OpenAI's Batch API hits the same 50% discount, also async with up to a 24-hour SLA, also up to 50,000 requests per batch. Differences worth knowing:

Dimension	Anthropic Batches	OpenAI Batch
Discount	50%	50%
SLA	Up to 24h	Up to 24h
Max requests per batch	100,000	50,000
Submission format	JSON requests array in API call	JSONL file upload
Streaming	No	No
Tool use	Single-turn only	Single-turn only
Result retention	29 days	Until you delete

A practical batch pattern for evals

The Batches API was practically built for offline eval runs. The shape:

Prepare a JSONL of test cases, each with an input and expected output (or a rubric).
Submit one batch entry per test case with a custom_id matching the test case ID.
Poll until complete.
Score each row against the expected output using a separate scorer (could be another batch).
Aggregate into pass/fail metrics.

Observability matters more in batch

FAQ

How much does the Anthropic Batches API cost? 50% off the standard per-token rates for the model you batch against. Cache reads and writes get the same 50% discount applied.

How long does a batch take? SLA is up to 24 hours. Real-world latency is usually minutes to a few hours, depending on batch size and current load.

Can I use tool calling in batch? Single-turn tool calls work (the model returns a tool call; your application handles it later). Multi-turn agentic loops within one batch entry are not supported.

Can I cancel a batch? Yes, but cancellation is best-effort. Requests already in flight may complete and bill.

What models are supported? All current Messages API models: Opus 4.7, Sonnet 4.6, Haiku 4.5, plus older snapshots. Confirm the latest list in the Anthropic docs before launching.

Anthropic Message Batches API: 50% Off Async Jobs (2026)

TL;DR

When the Batches API is a no-brainer

The cost math

What is supported and what is not

Python: submitting and polling a batch

TypeScript: same flow

Batch rate limits and queue size

Gotchas worth knowing before you ship

Comparison to OpenAI Batch API

A practical batch pattern for evals

Observability matters more in batch

FAQ

Related articles

Anthropic API Rate Limits + 429/529 Handling Guide (2026)

Claude Prompt Caching: 90% Cost Reduction Guide (2026)

Intent Classification With LLMs

Built for AI agents.
Break less.
Ship more.

Anthropic Message Batches API: 50% Off Async Jobs (2026)

TL;DR

When the Batches API is a no-brainer

The cost math

What is supported and what is not

Python: submitting and polling a batch

TypeScript: same flow

Batch rate limits and queue size

Gotchas worth knowing before you ship

Comparison to OpenAI Batch API

A practical batch pattern for evals

Observability matters more in batch

FAQ

Related articles

Anthropic API Rate Limits + 429/529 Handling Guide (2026)

Claude Prompt Caching: 90% Cost Reduction Guide (2026)

Intent Classification With LLMs

Built for AI agents.
Break less.
Ship more.

Related articles

How-to
Anthropic API Rate Limits + 429/529 Handling Guide (2026)
Anthropic API rate limits explained: Build Tier 1-4 RPM/ITPM/OTPM, 429 vs 529 overloaded errors, exponential backoff with jitter, and gateway fallback patterns.
Frank Chen · 1 day ago

How-to
Claude Prompt Caching: 90% Cost Reduction Guide (2026)
Claude prompt caching guide: ephemeral 5-min cache vs 1-hour cache, cache_control breakpoints, ~90% cost reduction on cached input, Python and TS examples.
Frank Chen · 1 day ago

How-to
Intent Classification With LLMs
Intent classification with LLMs: BERT vs few-shot LLM vs structured outputs, code examples, eval setup (precision/recall by class), production routing patterns.
Frank Chen · 1 day ago

Anthropic Message Batches API: 50% Off Async Jobs (2026)

TL;DR

When the Batches API is a no-brainer

The cost math

What is supported and what is not

Python: submitting and polling a batch

TypeScript: same flow

Batch rate limits and queue size

Gotchas worth knowing before you ship

Comparison to OpenAI Batch API

A practical batch pattern for evals

Observability matters more in batch

FAQ

Related

Related articles

Anthropic API Rate Limits + 429/529 Handling Guide (2026)

Claude Prompt Caching: 90% Cost Reduction Guide (2026)

Intent Classification With LLMs

Built for AI agents. Break less. Ship more.

Anthropic Message Batches API: 50% Off Async Jobs (2026)

TL;DR

When the Batches API is a no-brainer

The cost math

What is supported and what is not

Python: submitting and polling a batch

TypeScript: same flow

Batch rate limits and queue size

Gotchas worth knowing before you ship

Comparison to OpenAI Batch API

A practical batch pattern for evals

Observability matters more in batch

FAQ

Related

Related articles

Anthropic API Rate Limits + 429/529 Handling Guide (2026)

Claude Prompt Caching: 90% Cost Reduction Guide (2026)

Intent Classification With LLMs

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.