LLM inference is what happens when you call an LLM API or run a model locally — the process of producing model output for a given input. Understanding inference matters because it's where cost, latency, and quality are determined. The model and the prompt are knobs you tune; the inference pipeline is what actually runs.
For most application engineers, you're consuming inference through a hosted API (OpenAI, Anthropic, etc.) and the implementation is hidden. But the patterns and tradeoffs leak through pricing, latency, and behavior — so worth understanding.
TL;DR
LLM inference takes:
- An input prompt (text or multimodal)
- A model and parameters (temperature, max_tokens, etc.)
And produces:
- An output completion (token by token)
- Usage data (input tokens, output tokens, cost, latency)
The end-to-end latency has two main components: TTFT (time to first token, perceived responsiveness) and generation time (time to complete the response). Both matter, for different reasons.
How inference works (briefly)
Every LLM is a transformer-based neural network that generates output one token at a time. Each token requires running the entire prompt + generated-so-far through the model.
Two phases:
- Prefill — process the input prompt to build the model's internal state. Cost scales with input length. This is where TTFT is determined.
- Decode — generate output tokens one at a time. Cost scales with output length and is roughly linear per token.
This split explains pricing: inputs are cheaper than outputs because prefill is a parallelizable bulk operation while decode is sequential and slower.
Latency: TTFT vs generation time
For streaming interfaces (chat UI, real-time agents), users care about TTFT — how long until the first character appears. Common values in 2026:
| Model | TTFT P50 | TTFT P95 |
|---|---|---|
| Claude Haiku 4.5 | ~300ms | ~700ms |
| Claude Sonnet 4.6 | ~400ms | ~1.0s |
| GPT-5.4 | ~500ms | ~1.3s |
| GPT-5.5 | ~700ms | ~2.0s |
| Reasoning-heavy paths | 5-30s | 10-60s |
TTFT is dominated by prefill (input processing) and provider load. Larger context windows and reasoning models increase TTFT.
Generation time is the time from first to last token. Scales linearly with output length at the model's per-token throughput. For a 500-token response on a $3/$15 model, you'll see ~3-5 seconds total.
The gap between P95 and P99 generation time is usually where production pain hides. Long-tail generations can be 5-10× the P95 — invisible in averages but felt by some fraction of users.
Pricing components
Inference cost is mostly:
- Input tokens — usually $0.20 to $5 per million (May 2026 frontier prices)
- Output tokens — usually $1 to $30 per million, often 5-10× input price
Output tokens drive most cost in production because:
- A typical response is much longer than the prompt
- Prefill (input processing) is parallelizable; decode is sequential and slower per token
Two pricing optimization patterns:
Prompt caching — providers cache the model's internal state for the prefix of your prompts. If your system prompt is stable, the cached input portion costs ~10% of the standard rate. For applications with consistent system prompts, this is a 5-9× cost reduction.
Batch processing — submit many requests asynchronously, get them back within 24 hours. Both major providers offer 50% off batch processing.
Inference at scale
If you're running serious volume (1M+ requests/day), three patterns to know:
- Tier the right model to the right task. GPT-5.4 nano at $0.20/$1.25 for high-volume background. Sonnet 4.6 at $3/$15 for production-quality. Opus 4.7 at $5/$25 for the hardest tasks. Use a gateway to route automatically.
- Cache aggressively. Stable system prompts → 10× input cost reduction. Use semantic caching carefully (false positives ship stale answers).
- Batch when async is acceptable. 50% off both providers' rates.
Self-hosted inference
For high-volume workloads or data residency requirements, self-hosted inference can beat API pricing:
- Open-weight models — DeepSeek V3, Llama, Qwen, Mistral models are downloadable
- Inference providers — Together AI, Fireworks, Groq, Replicate offer hosted open-weight inference at competitive prices
- Self-hosted on your own GPUs — best at extreme volume; ops burden is real
The math depends on your scale. Below ~10M tokens/day, hosted APIs are usually cheaper. Above ~100M tokens/day, self-hosted starts winning.
Inference observability
What to measure in production:
- TTFT P50/P95/P99 — user-perceived responsiveness
- Generation time P50/P95/P99 — cost and capacity proxy
- Tokens per request — sliced by model, feature, user
- Cost per active user — the real product metric
- Error rate by model — provider reliability tracking
- Eval score over time — quality regression detection
Without these, you're flying blind. See LLM Observability for the full setup.
Common mistakes
- Optimizing for generation time when users care about TTFT. Streaming interfaces need TTFT first.
- Ignoring the P95-to-P99 gap. Long-tail generations are where customer experience pain lives.
- No prompt caching. A 10× cost reduction left on the table.
- Wrong model tier for the task. Running everything through GPT-5.5 when GPT-5.4 nano would do.
- No multi-provider fallback. First provider outage takes your app down.
FAQ
What's the difference between training and inference? Training builds the model (one-time, expensive). Inference uses the model to generate output (per-request, ongoing).
What is TTFT? Time to first token — how long from request to the first character of the response. Most important latency metric for streaming interfaces.
Why are output tokens more expensive than input? Output is generated sequentially (one token at a time), input is processed in parallel during prefill. The decode phase is slower per token, hence pricier.
What's prompt caching? Provider-side caching of the model's internal state for the prefix of your prompts. Cached input tokens cost ~10% of standard rate. Available on most major providers.
Should I self-host inference? Below ~10M tokens/day, no — hosted APIs are cheaper. Above ~100M tokens/day or for data residency requirements, yes.
How do I reduce inference cost? Tier the right model per task, enable caching, use batching for async workloads, and route via a gateway so model swaps are config not code.