LLM inference is what happens when you call an LLM API or run a model locally — the process of producing model output for a given input. Understanding inference matters because it's where cost, latency, and quality are determined. The model and the prompt are knobs you tune; the inference pipeline is what actually runs.

For most application engineers, you're consuming inference through a hosted API (OpenAI, Anthropic, etc.) and the implementation is hidden. But the patterns and tradeoffs leak through pricing, latency, and behavior — so worth understanding.

TL;DR

LLM inference takes:

An input prompt (text or multimodal)
A model and parameters (temperature, max_tokens, etc.)

And produces:

An output completion (token by token)
Usage data (input tokens, output tokens, cost, latency)

The end-to-end latency has two main components: TTFT (time to first token, perceived responsiveness) and generation time (time to complete the response). Both matter, for different reasons.

How inference works (briefly)

Every LLM is a transformer-based neural network that generates output one token at a time. Each token requires running the entire prompt + generated-so-far through the model.

Two phases:

Prefill — process the input prompt to build the model's internal state. Cost scales with input length. This is where TTFT is determined.
Decode — generate output tokens one at a time. Cost scales with output length and is roughly linear per token.

This split explains pricing: inputs are cheaper than outputs because prefill is a parallelizable bulk operation while decode is sequential and slower.

Latency: TTFT vs generation time

For streaming interfaces (chat UI, real-time agents), users care about TTFT — how long until the first character appears. Common values in 2026:

Model	TTFT P50	TTFT P95
Claude Haiku 4.5	~300ms	~700ms
Claude Sonnet 4.6	~400ms	~1.0s
GPT-5.4	~500ms	~1.3s
GPT-5.5	~700ms	~2.0s
Reasoning-heavy paths	5-30s	10-60s

TTFT is dominated by prefill (input processing) and provider load. Larger context windows and reasoning models increase TTFT.

Generation time is the time from first to last token. Scales linearly with output length at the model's per-token throughput. For a 500-token response on a $3/$15 model, you'll see ~3-5 seconds total.

The gap between P95 and P99 generation time is usually where production pain hides. Long-tail generations can be 5-10× the P95 — invisible in averages but felt by some fraction of users.

Pricing components

Inference cost is mostly:

Input tokens — usually $0.20 to $5 per million (May 2026 frontier prices)
Output tokens — usually $1 to $30 per million, often 5-10× input price

Output tokens drive most cost in production because:

A typical response is much longer than the prompt
Prefill (input processing) is parallelizable; decode is sequential and slower per token

Two pricing optimization patterns:

Prompt caching — providers cache the model's internal state for the prefix of your prompts. If your system prompt is stable, the cached input portion costs ~10% of the standard rate. For applications with consistent system prompts, this is a 5-9× cost reduction.

Batch processing — submit many requests asynchronously, get them back within 24 hours. Both major providers offer 50% off batch processing.

Inference at scale

If you're running serious volume (1M+ requests/day), three patterns to know:

Tier the right model to the right task. GPT-5.4 nano at $0.20/$1.25 for high-volume background. Sonnet 4.6 at $3/$15 for production-quality. Opus 4.7 at $5/$25 for the hardest tasks. Use a gateway to route automatically.
Cache aggressively. Stable system prompts → 10× input cost reduction. Use semantic caching carefully (false positives ship stale answers).
Batch when async is acceptable. 50% off both providers' rates.

Self-hosted inference

For high-volume workloads or data residency requirements, self-hosted inference can beat API pricing:

Open-weight models — DeepSeek V3, Llama, Qwen, Mistral models are downloadable
Inference providers — Together AI, Fireworks, Groq, Replicate offer hosted open-weight inference at competitive prices
Self-hosted on your own GPUs — best at extreme volume; ops burden is real

The math depends on your scale. Below ~10M tokens/day, hosted APIs are usually cheaper. Above ~100M tokens/day, self-hosted starts winning.

Inference observability

What to measure in production:

TTFT P50/P95/P99 — user-perceived responsiveness
Generation time P50/P95/P99 — cost and capacity proxy
Tokens per request — sliced by model, feature, user
Cost per active user — the real product metric
Error rate by model — provider reliability tracking
Eval score over time — quality regression detection

Without these, you're flying blind. See LLM Observability for the full setup.

Common mistakes

Optimizing for generation time when users care about TTFT. Streaming interfaces need TTFT first.
Ignoring the P95-to-P99 gap. Long-tail generations are where customer experience pain lives.
No prompt caching. A 10× cost reduction left on the table.
Wrong model tier for the task. Running everything through GPT-5.5 when GPT-5.4 nano would do.
No multi-provider fallback. First provider outage takes your app down.

FAQ

What's the difference between training and inference? Training builds the model (one-time, expensive). Inference uses the model to generate output (per-request, ongoing).

What is TTFT? Time to first token — how long from request to the first character of the response. Most important latency metric for streaming interfaces.

Why are output tokens more expensive than input? Output is generated sequentially (one token at a time), input is processed in parallel during prefill. The decode phase is slower per token, hence pricier.

What's prompt caching? Provider-side caching of the model's internal state for the prefix of your prompts. Cached input tokens cost ~10% of standard rate. Available on most major providers.

Should I self-host inference? Below ~10M tokens/day, no — hosted APIs are cheaper. Above ~100M tokens/day or for data residency requirements, yes.

How do I reduce inference cost? Tier the right model per task, enable caching, use batching for async workloads, and route via a gateway so model swaps are config not code.

TL;DR

LLM inference takes:

An input prompt (text or multimodal)
A model and parameters (temperature, max_tokens, etc.)

And produces:

An output completion (token by token)
Usage data (input tokens, output tokens, cost, latency)

The end-to-end latency has two main components: TTFT (time to first token, perceived responsiveness) and generation time (time to complete the response). Both matter, for different reasons.

How inference works (briefly)

Every LLM is a transformer-based neural network that generates output one token at a time. Each token requires running the entire prompt + generated-so-far through the model.

Two phases:

Prefill — process the input prompt to build the model's internal state. Cost scales with input length. This is where TTFT is determined.
Decode — generate output tokens one at a time. Cost scales with output length and is roughly linear per token.

This split explains pricing: inputs are cheaper than outputs because prefill is a parallelizable bulk operation while decode is sequential and slower.

Latency: TTFT vs generation time

For streaming interfaces (chat UI, real-time agents), users care about TTFT — how long until the first character appears. Common values in 2026:

Model	TTFT P50	TTFT P95
Claude Haiku 4.5	~300ms	~700ms
Claude Sonnet 4.6	~400ms	~1.0s
GPT-5.4	~500ms	~1.3s
GPT-5.5	~700ms	~2.0s
Reasoning-heavy paths	5-30s	10-60s

TTFT is dominated by prefill (input processing) and provider load. Larger context windows and reasoning models increase TTFT.

The gap between P95 and P99 generation time is usually where production pain hides. Long-tail generations can be 5-10× the P95 — invisible in averages but felt by some fraction of users.

Pricing components

Inference cost is mostly:

Input tokens — usually $0.20 to $5 per million (May 2026 frontier prices)
Output tokens — usually $1 to $30 per million, often 5-10× input price

Output tokens drive most cost in production because:

A typical response is much longer than the prompt
Prefill (input processing) is parallelizable; decode is sequential and slower per token

Two pricing optimization patterns:

Batch processing — submit many requests asynchronously, get them back within 24 hours. Both major providers offer 50% off batch processing.

Inference at scale

If you're running serious volume (1M+ requests/day), three patterns to know:

Tier the right model to the right task. GPT-5.4 nano at $0.20/$1.25 for high-volume background. Sonnet 4.6 at $3/$15 for production-quality. Opus 4.7 at $5/$25 for the hardest tasks. Use a gateway to route automatically.
Cache aggressively. Stable system prompts → 10× input cost reduction. Use semantic caching carefully (false positives ship stale answers).
Batch when async is acceptable. 50% off both providers' rates.

Self-hosted inference

For high-volume workloads or data residency requirements, self-hosted inference can beat API pricing:

Open-weight models — DeepSeek V3, Llama, Qwen, Mistral models are downloadable
Inference providers — Together AI, Fireworks, Groq, Replicate offer hosted open-weight inference at competitive prices
Self-hosted on your own GPUs — best at extreme volume; ops burden is real

The math depends on your scale. Below ~10M tokens/day, hosted APIs are usually cheaper. Above ~100M tokens/day, self-hosted starts winning.

Inference observability

What to measure in production:

TTFT P50/P95/P99 — user-perceived responsiveness
Generation time P50/P95/P99 — cost and capacity proxy
Tokens per request — sliced by model, feature, user
Cost per active user — the real product metric
Error rate by model — provider reliability tracking
Eval score over time — quality regression detection

Without these, you're flying blind. See LLM Observability for the full setup.

Common mistakes

Optimizing for generation time when users care about TTFT. Streaming interfaces need TTFT first.
Ignoring the P95-to-P99 gap. Long-tail generations are where customer experience pain lives.
No prompt caching. A 10× cost reduction left on the table.
Wrong model tier for the task. Running everything through GPT-5.5 when GPT-5.4 nano would do.
No multi-provider fallback. First provider outage takes your app down.

FAQ

What's the difference between training and inference? Training builds the model (one-time, expensive). Inference uses the model to generate output (per-request, ongoing).

What is TTFT? Time to first token — how long from request to the first character of the response. Most important latency metric for streaming interfaces.

What's prompt caching? Provider-side caching of the model's internal state for the prefix of your prompts. Cached input tokens cost ~10% of standard rate. Available on most major providers.

Should I self-host inference? Below ~10M tokens/day, no — hosted APIs are cheaper. Above ~100M tokens/day or for data residency requirements, yes.

How do I reduce inference cost? Tier the right model per task, enable caching, use batching for async workloads, and route via a gateway so model swaps are config not code.

What Is LLM Inference?

TL;DR

How inference works (briefly)

Latency: TTFT vs generation time

Pricing components

Inference at scale

Self-hosted inference

Inference observability

Common mistakes

FAQ

Related articles

What Is Prompt Evaluation?

What Is Prompt Versioning?

What Is a RAG Pipeline?

Built for AI agents.
Break less.
Ship more.

What Is LLM Inference?

TL;DR

How inference works (briefly)

Latency: TTFT vs generation time

Pricing components

Inference at scale

Self-hosted inference

Inference observability

Common mistakes

FAQ

Related articles

What Is Prompt Evaluation?

What Is Prompt Versioning?

What Is a RAG Pipeline?

Built for AI agents.
Break less.
Ship more.

Related articles

Explainer
What Is Prompt Evaluation?
Prompt evaluation explained: what it is, why it matters, the three types (rule-based, LLM-as-judge, human review), and how to build a real eval pipeline.
Frank Chen · 18 hours ago

Explainer
What Is Prompt Versioning?
Prompt versioning explained: what it is, why it matters, how it works, the tools that do it, and how to build a prompt change workflow that doesn't break production.
Frank Chen · 18 hours ago

Explainer
What Is a RAG Pipeline?
RAG pipeline explained: what it is, the components (chunking, embedding, retrieval, generation), common architectures, agentic RAG, and how to ship one in production.
Frank Chen · 18 hours ago

What Is LLM Inference?

TL;DR

How inference works (briefly)

Latency: TTFT vs generation time

Pricing components

Inference at scale

Self-hosted inference

Inference observability

Common mistakes

FAQ

Related

Related articles

What Is Prompt Evaluation?

What Is Prompt Versioning?

What Is a RAG Pipeline?

Built for AI agents. Break less. Ship more.

What Is LLM Inference?

TL;DR

How inference works (briefly)

Latency: TTFT vs generation time

Pricing components

Inference at scale

Self-hosted inference

Inference observability

Common mistakes

FAQ

Related

Related articles

What Is Prompt Evaluation?

What Is Prompt Versioning?

What Is a RAG Pipeline?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.