What is Batching? | AI & LLM Glossary

Batching is the technique of grouping multiple inference requests together and processing them simultaneously through a model, rather than handling each request individually. It significantly improves hardware utilization and throughput, reducing the per-request cost of running LLMs in production.

When a single request is sent to an LLM, the GPU often has spare computational capacity that goes unused. The model's weights are loaded into memory, but much of the available parallel processing power sits idle because a single sequence cannot fully saturate the hardware. Batching addresses this by combining multiple requests into a single forward pass, allowing the GPU to process them in parallel and achieve much higher throughput.

There are several approaches to batching in LLM serving. Static batching collects a fixed number of requests and processes them together, but this introduces latency because the system must wait for enough requests to accumulate. Dynamic batching groups requests as they arrive within a short time window, balancing throughput with latency. Continuous batching, used by modern serving frameworks like vLLM, is the most sophisticated approach: it allows new requests to join an in-progress batch as existing requests finish, maximizing GPU utilization at all times.

Batching interacts with several other aspects of LLM serving. Larger batches require more GPU memory, creating a trade-off between batch size and the maximum sequence length that can be supported. The KV-cache for each request in a batch must fit in memory simultaneously, which is why memory-efficient techniques like PagedAttention are critical for enabling large batch sizes.

For production systems, choosing the right batching strategy depends on the workload pattern. Real-time applications like chatbots need low latency and benefit from continuous batching. Offline tasks like document processing can tolerate higher latency and benefit from large static batches that maximize throughput.

How It Works

Collect incoming requests

The serving system receives individual inference requests from users or applications. Rather than processing each one immediately, it buffers them in a queue for a short configurable window.

Form a batch

Once enough requests have accumulated or the time window expires, the system groups them into a batch. Requests may be padded to match lengths, or more sophisticated systems handle variable-length sequences efficiently.

Process in parallel

The batch is sent through the model in a single forward pass. The GPU processes all sequences in the batch simultaneously, leveraging its massive parallelism to handle multiple requests with only marginally more time than a single request.

Return individual results

After the forward pass completes, the system separates the batch output into individual responses and returns each result to the appropriate requester. In continuous batching, completed requests are returned immediately while the batch continues processing remaining sequences.

Examples

Batch embedding generation

A search company needs to generate embeddings for 10 million documents. Instead of sending one document at a time, they batch 256 documents per request, reducing total processing time from days to hours and cutting API costs significantly.

Real-time chat with continuous batching

A customer service platform handles hundreds of concurrent chat sessions. Continuous batching allows the LLM server to process all active conversations simultaneously, maintaining low latency per user while efficiently using GPU resources across all sessions.

Nightly report generation

A financial services firm generates personalized investment reports for thousands of clients. They submit all report generation requests as a large batch during off-peak hours, taking advantage of maximum batch sizes for the highest possible throughput at the lowest cost per report.

Why It Matters

Batching is one of the most impactful optimizations for running LLMs cost-effectively at scale. Without batching, GPU utilization can drop below 30%, meaning organizations pay for expensive hardware that mostly sits idle. Proper batching can improve throughput by 5-10x without adding hardware, directly reducing the cost of serving AI applications in production.

Frequently Asked Questions

What is the difference between static and continuous batching?

Static batching waits for a fixed number of requests, processes them all together, and returns results when the entire batch completes. Continuous batching is more flexible: it can add new requests to an in-progress batch and return individual results as soon as each request finishes, providing better latency and higher GPU utilization.

How does batch size affect latency and throughput?

Larger batch sizes generally increase throughput (requests processed per second) but may increase per-request latency because the GPU must process more sequences. There is a sweet spot where throughput is maximized without unacceptable latency increases, and this depends on the specific model, hardware, and sequence lengths involved.

Can I batch requests with different prompt lengths?

Yes, but variable-length sequences add complexity. Simple implementations pad all sequences to the longest one, wasting computation. Advanced serving systems like vLLM use PagedAttention to handle variable-length sequences efficiently without padding, making batching effective even when request lengths vary significantly.

When should I use batch API endpoints vs real-time endpoints?

Use batch API endpoints for tasks that do not require immediate responses, such as data processing, report generation, or evaluation pipelines. Use real-time endpoints for interactive applications like chatbots. Many providers offer batch endpoints at significant discounts (often 50% off) because they can schedule batch work during off-peak periods.

Optimize Batching Performance with Respan

Respan helps you monitor batching efficiency in your LLM infrastructure by tracking throughput, latency percentiles, queue depths, and GPU utilization in real time. Identify bottlenecks, find the optimal batch size for your workload, and set alerts when batching performance degrades.

Try Respan free

What is Batching? | AI & LLM Glossary

How It Works

Collect incoming requests

The serving system receives individual inference requests from users or applications. Rather than processing each one immediately, it buffers them in a queue for a short configurable window.

Form a batch

Process in parallel

Return individual results

Examples

Batch embedding generation

Real-time chat with continuous batching

Nightly report generation

Why It Matters

Frequently Asked Questions

What is the difference between static and continuous batching?

How does batch size affect latency and throughput?

Can I batch requests with different prompt lengths?

When should I use batch API endpoints vs real-time endpoints?

Optimize Batching Performance with Respan

Try Respan free

What is Batching? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Optimize Batching Performance with Respan

What is Batching? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Optimize Batching Performance with Respan