TL;DR
An LLM gateway is a transparent proxy that gives your application one OpenAI-compatible endpoint for many model providers, plus authentication, budgets, caching, fallback, and observability. The eight serious choices in 2026 are LiteLLM, Portkey, OpenRouter, Helicone, TrueFoundry, Kong AI, Braintrust, and Respan. They diverge on self-host story, observability depth, and what governance ships out of the box. Comparison table and decision tree below.
What an LLM gateway actually does
Strip away the marketing and an LLM gateway is doing four jobs at the same time. It accepts a request in one format (the OpenAI chat completions format won that race), routes it to the right model provider, applies a set of policies on the way in and out (auth, rate limits, budgets, PII redaction, caching), and writes a trace so somebody can debug what happened later. That's it.
The interesting variation between gateways is which of those jobs they do well, which they barely do, and what they bundle on top — observability, evals, prompt management, sometimes a full agent runtime. The hard question is not what is a gateway; it's which of those bundled jobs your team actually needs.
Why you start needing one
Teams hit the gateway-shaped problem at roughly the same five milestones. If you recognize three or more of these, you've already outgrown direct provider SDK calls.
- The second provider. The day you add Anthropic next to OpenAI is the day your app starts caring about request schemas, retry behavior, streaming differences, and tool-call formats that don't match. Either your app handles all that, or the gateway does.
- The second team. Marketing wants Claude Sonnet for copy. The backend team uses GPT-5 mini for classification. Both have separate API keys with no shared budget. Cost goes up linearly with team count and nobody can see who is spending what.
- The first outage. OpenAI degrades for 40 minutes. Your customer-support agent stops working. You have no failover plan and 200 angry support tickets. The gateway is where the fallback rule lives.
- The CFO question. "Why did our LLM bill go from $12k to $48k last month?" Without per-key, per-team, per-customer attribution, you cannot answer this. Most provider dashboards show you the bill but not the breakdown.
- The compliance ask. Legal wants every prompt and response that involved a customer to be logged in your system, not a third party. SOC 2, HIPAA, GDPR. The gateway is the choke point where that logging happens — and the only realistic place to enforce PII redaction before data leaves your network.

The thing the "LLM gateway" category gets wrong is treating it as a pure transport layer — yet another API proxy. In 2026 the actual unit of work is no longer the chat completion; it's the agent run with twelve chat completions, eight tool calls, and three sub-agent recursions. The gateway has to be aware of that, because the interesting decisions — fallback, budget enforcement, eval replay — happen at the agent level, not the call level.
Which is why the gateway-as-standalone-product story is dying. Most teams will end up wanting one tool that handles gateway, tracing, evals, and prompt management together. The four-tool stitched stack is how bugs hide.
Teams running LLM apps in production with Respan
Tired of stitching LiteLLM + LangSmith + Promptfoo?
Respan ships gateway, tracing, evals, and prompt management as one product with one auth and one bill. ~35% average cost reduction via routing and caching.
Try Respan freeThe seven things a production gateway must do
Most posts on this topic list "features" without telling you which ones are table stakes and which are nice-to-have. Here's the practical cut.
1. Unified OpenAI-compatible endpoint
Every gateway worth evaluating accepts /v1/chat/completions in OpenAI format and translates per provider. This is the price of admission — if a tool doesn't do this, it's not a gateway. Watch for edge cases around streaming, tool calls (function calling), and structured outputs; the translation is rougher than vendors admit.
2. Per-key budgets and rate limits
Virtual API keys that you can issue per developer, per environment, or per customer — each with its own monthly budget and rate limit. Without this, you cannot do shadow-IT cleanup or cost attribution. LiteLLM and Portkey have mature implementations; OpenRouter handles this through credit ledgers; Helicone exposes it via unified billing.
3. Automatic provider fallback
When OpenAI returns a 503 or your request hits a rate limit, the gateway retries on a configured backup model. Most gateways do this; the difference is how granular the rules are. Can you fall back to a different provider entirely? A different model within the same provider? Different region? Different timeout per fallback step? Pay attention to the configuration surface, not the marketing checkbox.
Anthropic 503 → Bedrock takes over. Your app sees a successful 200. Same model, different cloud. The detour adds a small amount of latency only in the failure case.
4. Caching — semantic and exact-match
Exact-match cache is straightforward: same prompt, same params, same response. Easy 10-30% cost savings on workloads with repeat queries. Semantic cache is the harder variant: embed the prompt, find similar past prompts within a similarity threshold, return the cached response. Adds 30-80% savings on chatbot workloads but introduces non-trivial risk (returning a stale answer to a slightly different question). Configurable thresholds and TTL are mandatory.
5. Request logging and tracing
Every request and response captured, with token counts, latency, and the model that served it. Plus parent/child spans for multi-step agents. This is where the line between "gateway" and "observability platform" blurs — and where the bundle matters. A pure gateway gives you a CSV; a bundled platform gives you a flame graph with eval scores.
6. Self-hosted model support
The gateway must let you point at your own vLLM, Ollama, or any OpenAI-compatible endpoint as if it were a managed provider. Non-negotiable for any team running a fine-tuned model in production. The pattern is universal but the ergonomics vary wildly — some gateways treat self-hosted as a second-class citizen with missing features (no streaming, no caching).
7. PII / policy enforcement at the edge
Redact or block requests containing PII, profanity, or prompt-injection patterns before they reach a third-party provider. Used to be enterprise-only; now table stakes for any team in healthcare, finance, or anything customer-facing in the EU. Kong and TrueFoundry lean on this heavily; Respan ships it in the gateway by default; LiteLLM and OpenRouter expect you to bring your own.
The 8 LLM gateways compared (May 2026)
Every cell verified against the vendor's own documentation or product page on May 16, 2026. Where a claim is unmeasured ("lightning-fast" with no number) we mark it as not disclosed rather than guessing.
| Gateway | Models supported | Deployment | Latency overhead | Fallback | Self-host | Caching | Per-key budgets | Observability |
|---|---|---|---|---|---|---|---|---|
| Respanus | 100+ | Cloud + Enterprise self-host | <50ms median | Yes | Yes | Semantic + exact | Yes | Built-in tracing + evals |
| LiteLLM | 100+ | OSS self-host + paid cloud | Not disclosed | Yes | Yes | Plugin | Virtual keys | Via integrations |
| Portkey | 1,600+ | OSS self-host + cloud | Claimed, no measurement | Yes | Yes | Yes | Yes | Built-in |
| OpenRouter | 300+ | Cloud only | Not disclosed | Yes | No | No | Credit-based | No |
| Helicone | 100+ | Cloud (gateway); OSS app | Not disclosed | Yes | No | Partial | Unified billing | Built-in |
| TrueFoundry | 250+ direct / 1,600+ unified | Cloud + on-prem + air-gapped | Sub-3ms internal (routing only) | Yes | Yes | Partial | Token quota | Yes |
| Kong AI Gateway | Multi-provider (count not published) | OSS Kong + cloud Konnect | Not disclosed | Via plugins | Yes | Semantic | Token quota | Via plugins |
| Braintrust | Not published | Cloud only | Not disclosed | Yes | No | Encrypted cache | Yes | Built-in evals |
Which gateway should you pick? A decision tree
The "best" gateway depends entirely on your team's deployment constraints and what else you need bundled. Pick the path that matches you.
- You want fully open-source + self-hosted with no vendor lock-in. Pick LiteLLM. The proxy is OSS, Docker-deployable, and has the deepest provider list among truly free options. The trade-off: you'll wire up your own observability, evals, and prompt management afterward.
- You want maximum model variety with zero infrastructure. Pick OpenRouter. 300+ models behind one credit-based account, no DevOps. The trade-off: you cannot self-host, you cannot do PII redaction, and you're paying per-token markup on every call.
- You need enterprise governance + on-prem. Pick TrueFoundry or Kong AI. Both ship air-gapped or VPC-deployable, both have mature PII and rate-limiting plugins. Kong is the choice if you already run Kong API Gateway elsewhere; TrueFoundry is the choice if you want LLM features deeper out of the box.
- You want OSS gateway + commercial polish. Pick Portkey. The gateway itself is open-source with 10K+ GitHub stars, and the managed cloud adds dashboards and enterprise auth on top.
- You want zero markup on provider pricing. Pick Helicone. Their 0% markup model passes provider prices straight through. The trade-off: gateway is cloud-only, and self-host is the OSS app, not the gateway.
- You want the full LLM ops loop in one product — gateway, observability, evals, and prompt management with shared auth and a single bill. Pick Respan or Braintrust. Braintrust leads with evals and adds a gateway; Respan leads with gateway + tracing and adds evals. Same fork in the road, opposite starting points.
Code: switch to a gateway in 60 seconds
Because every serious gateway is OpenAI-compatible, you change a base URL and an API key. The example uses Respan's gateway, but the same shape applies to LiteLLM, Portkey, OpenRouter, and Helicone — only the base URL and key change.
from openai import OpenAI
import os
# Before — direct to OpenAI
client = OpenAI(api_key="sk-...")
# After — through Respan's gateway. Same SDK, no other code change.
client = OpenAI(
api_key=os.environ["RESPAN_API_KEY"], # virtual key, per-team budget enforced
base_url="https://api.respan.ai/v1",
)
# Now you can switch model providers without touching this file:
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6", # or openai/gpt-5, gemini/pro, etc.
messages=[{"role": "user", "content": "Hello"}],
)That's the entire migration. Every request now passes through the gateway: logged, evaluated against your policies, optionally cached, and counted against your virtual key's budget. Switching models is a string change — no application redeploys.
How much latency does a gateway really add?
Every gateway claims it's fast. Numbers are scarce because the honest answer is "it depends on three things you can't read from a marketing page."
- Region. Cross-region adds 80–200ms. Gateway-in-us-east-1, OpenAI-in-us-east-1 — single-digit ms is realistic. Gateway-in-eu-west, OpenAI-in-us-east — your math is 80ms+ before the LLM even starts thinking.
- Synchronous logging. Some gateways block the response until the trace is written to durable storage. This adds 20-50ms per request. Better gateways write to a queue and return immediately.
- Cache check + policy. Semantic cache requires an embedding lookup (5-20ms). Policy evaluation adds 1-5ms per rule. Both are unavoidable but configurable.
Respan measures sub-50ms median overhead end-to-end against its primary US region (May 2026, internal load test). TrueFoundry claims "sub-3ms internal latency" but that's the routing layer only, not the full round-trip. Most vendors don't publish anything. If your application is TTFT-sensitive (voice, real-time chat), test before you commit.
Failure modes nobody mentions
A gateway is a single point of dependency for every LLM call your application makes. That's the deal. Three production failures to plan for before you deploy one.
Cascading retries
The gateway retries on the upstream provider. Your application retries on the gateway. The end user retries by clicking the button again. Three layers of retry compound into a thundering herd that takes down the gateway during a partial provider outage. Mitigation: configure explicit retry budgets on every layer (gateway: 2 retries, app: 0 retries, frontend: exponential backoff).
Semantic cache poisoning
Semantic cache returns a stored response when the new prompt's embedding is "close enough" to a past prompt. Set the threshold too loose and you start returning the wrong answer to subtly different questions. Set it too tight and the cache hits zero. Threshold calibration with a held-out eval set is mandatory before you ship this to production.
Streaming + tool calls
Streaming responses with tool calls are still the messiest part of the OpenAI-compatible spec. Different providers emit different partial-tool-call chunk formats. Gateways translate them imperfectly. If your product uses streamed tool calls, run a 24-hour soak test against your candidate gateway before committing — this is where the weird intermittent bugs live.
See Respan's gateway in 5 minutes
One OpenAI-compatible endpoint, 100+ models, automatic failover, semantic caching, per-key budgets — and full trace + eval + prompt management built in. Free to try, no credit card.
Migrating between gateways without rewriting your app
Because every gateway exposes OpenAI-compatible endpoints, the SDK side of the migration is trivial — change base URL, change API key. The hard parts:
- Trace history is not portable. Your six months of captured prompts on Gateway A don't move to Gateway B. Plan a one-time export of trace data into your warehouse if you need to retain it.
- Eval datasets curated from production traces are gateway-specific. You'll need to rebuild them — or, if you've been versioning them in a system outside the gateway (you should be), this is painless.
- Prompt versions and experiments need to be exported or recreated. Most gateways expose this via API; budget half a day per ten prompts.
- Dual-run for a week. Send 10% of traffic to the new gateway, watch error rates and eval scores. If they match, flip the ratio incrementally.
Production checklist before shipping a gateway
- Configure at least one fallback model per primary, with a different provider where possible.
- Set retry budgets on the gateway AND on the application — never rely on a single layer to handle backoff.
- Issue virtual keys per environment (dev/staging/prod) and per team, each with a monthly budget cap.
- Enable async (non-blocking) request logging so traces don't add latency to your hot path.
- If using semantic cache, run a 1,000-request eval against your actual traffic with the threshold you intend to use, before enabling it for users.
- Test streaming + tool calls against your top two providers for at least 24 hours of real traffic. Streaming bugs hide.
- Wire trace export from the gateway into your existing observability stack (Datadog, Honeycomb, etc.) so the gateway is not a silo.
- Document the on-call procedure for "the gateway itself is down." You need it.
Related comparisons
Want to dig into specific tools? Each of these is a head-to-head with Respan plus integration notes.
- LiteLLM alternatives — when the OSS proxy stops scaling.
- Portkey alternatives — Portkey vs the commercial managed gateways.
- OpenRouter alternatives — when you outgrow credit-based routing.
- Helicone alternatives — Helicone vs the dedicated observability platforms.
- LLM Observability: the complete guide — the next layer of the stack after the gateway.