What is an LLM gateway?

An LLM gateway is a transparent proxy that sits between your application and one or more language model providers. It accepts a request in a unified format (almost always OpenAI-compatible), routes it to the right provider, applies policies like rate limits, budgets, and PII redaction, and returns the response. The point is that your application code does not need to know which underlying model served the request, and you can change provider, model, or self-hosted endpoint without redeploying.

Do I really need an LLM gateway?

If you only ever call one model from one provider in a small app, no. The moment you have two providers, two teams, or one billing department, yes. The fragmentation cost shows up in three places: per-team API key sprawl, untracked spend, and the day a provider has an outage and you have no failover plan. The gateway is the cheapest insurance against all three.

Is an LLM gateway the same as an AI gateway?

In 2026 the terms are used interchangeably. 'AI gateway' sometimes implies broader scope (image, audio, embeddings, MCP servers) and 'LLM gateway' implies text-generation-only. In practice every serious product covers all modalities, so the labels are largely marketing.

Is LiteLLM the same as a gateway?

LiteLLM ships two things: a Python SDK that normalizes call signatures, and the LiteLLM Proxy — the actual gateway. People sometimes say 'LiteLLM' meaning the SDK, but it's the proxy that gives you authentication, virtual API keys, budgets, and routing across providers. The proxy is what competes in this category.

Should I self-host or use a cloud gateway?

Self-host if you have a compliance reason that requires it (HIPAA on private VPC, air-gapped, data-residency) or engineers who want to own the failure modes. Use cloud otherwise — the gateways that run as a managed service have already absorbed the on-call burden and the multi-region failover that you would otherwise rebuild.

How much latency does a gateway add?

Median overhead from a well-implemented gateway is 30–80ms when the gateway and the LLM provider are in the same region. Cross-region adds 80–200ms. Synchronous logging can add another 20-50ms; better gateways write to a queue and return immediately. Anything claiming 'single-digit ms' is usually measuring internal routing only, not user-perceived overhead.

Can I switch gateways without rewriting my application?

If both gateways expose an OpenAI-compatible endpoint, switching is a base_url change in your client config. The hidden cost is in observability data: traces, eval datasets, and prompt history are not portable. Plan a migration window where you keep both running, point new requests at the new gateway, and run a backfill or accept a one-time history reset.

What's the difference between an LLM gateway and a model router?

A model router is one feature of an LLM gateway. The router decides which model to call for a given request (cost-optimization, intent-classification, fallback-on-error). A gateway also handles authentication, per-key budgets, caching, observability, and policy enforcement. Standalone routers exist but most teams want the bundle.

Do LLM gateways work with self-hosted models like vLLM or Ollama?

Yes, if the gateway accepts arbitrary OpenAI-compatible endpoints. LiteLLM, Portkey, TrueFoundry, and Respan all support this. Your vLLM or Ollama instance exposes /v1/chat/completions; the gateway adds it as a custom provider; your app continues to call the gateway. This is how teams blend a fine-tuned smaller model with a commercial API for fallback.

What about Bedrock Converse or Vertex's unified API — are those gateways?

Sort of, with a vendor catch. Both give you one API across multiple models from one cloud (AWS or Google). You don't get cross-cloud routing, you can't easily add a non-cloud provider, and the failover story is internal to that cloud. If you're already all-in on one cloud they're fine; if you want neutrality, use a dedicated gateway.

AI Gateway: A Practitioner's Guide (2026)

TL;DR

An LLM gateway is a transparent proxy that gives your application one OpenAI-compatible endpoint for many model providers, plus authentication, budgets, caching, fallback, and observability. The eight serious choices in 2026 are LiteLLM, Portkey, OpenRouter, Helicone, TrueFoundry, Kong AI, Braintrust, and Respan. They diverge on self-host story, observability depth, and what governance ships out of the box. Comparison table and decision tree below.

80M+

LLM requests / day on Respan

<50ms

median gateway overhead

~35%

avg cost reduction via cache + routing

100+

model providers supported

What an LLM gateway actually does

Strip away the marketing and an LLM gateway is doing four jobs at the same time. It accepts a request in one format (the OpenAI chat completions format won that race), routes it to the right model provider, applies a set of policies on the way in and out (auth, rate limits, budgets, PII redaction, caching), and writes a trace so somebody can debug what happened later. That's it.

The interesting variation between gateways is which of those jobs they do well, which they barely do, and what they bundle on top — observability, evals, prompt management, sometimes a full agent runtime. The hard question is not what is a gateway; it's which of those bundled jobs your team actually needs.

Why you start needing one

Teams hit the gateway-shaped problem at roughly the same five milestones. If you recognize three or more of these, you've already outgrown direct provider SDK calls.

The second provider. The day you add Anthropic next to OpenAI is the day your app starts caring about request schemas, retry behavior, streaming differences, and tool-call formats that don't match. Either your app handles all that, or the gateway does.
The second team. Marketing wants Claude Sonnet for copy. The backend team uses GPT-5 mini for classification. Both have separate API keys with no shared budget. Cost goes up linearly with team count and nobody can see who is spending what.
The first outage. OpenAI degrades for 40 minutes. Your customer-support agent stops working. You have no failover plan and 200 angry support tickets. The gateway is where the fallback rule lives.
The CFO question. "Why did our LLM bill go from $12k to $48k last month?" Without per-key, per-team, per-customer attribution, you cannot answer this. Most provider dashboards show you the bill but not the breakdown.
The compliance ask. Legal wants every prompt and response that involved a customer to be logged in your system, not a third party. SOC 2, HIPAA, GDPR. The gateway is the choke point where that logging happens — and the only realistic place to enforce PII redaction before data leaves your network.

Founder's take

Frank Chen · Head of DevRel, Respan

The thing the "LLM gateway" category gets wrong is treating it as a pure transport layer — yet another API proxy. In 2026 the actual unit of work is no longer the chat completion; it's the agent run with twelve chat completions, eight tool calls, and three sub-agent recursions. The gateway has to be aware of that, because the interesting decisions — fallback, budget enforcement, eval replay — happen at the agent level, not the call level.

Which is why the gateway-as-standalone-product story is dying. Most teams will end up wanting one tool that handles gateway, tracing, evals, and prompt management together. The four-tool stitched stack is how bugs hide.

Teams running LLM apps in production with Respan

Tired of stitching LiteLLM + LangSmith + Promptfoo?

Respan ships gateway, tracing, evals, and prompt management as one product with one auth and one bill. ~35% average cost reduction via routing and caching.

Try Respan free

The seven things a production gateway must do

Most posts on this topic list "features" without telling you which ones are table stakes and which are nice-to-have. Here's the practical cut.

1. Unified OpenAI-compatible endpoint

Every gateway worth evaluating accepts /v1/chat/completions in OpenAI format and translates per provider. This is the price of admission — if a tool doesn't do this, it's not a gateway. Watch for edge cases around streaming, tool calls (function calling), and structured outputs; the translation is rougher than vendors admit.

2. Per-key budgets and rate limits

Virtual API keys that you can issue per developer, per environment, or per customer — each with its own monthly budget and rate limit. Without this, you cannot do shadow-IT cleanup or cost attribution. LiteLLM and Portkey have mature implementations; OpenRouter handles this through credit ledgers; Helicone exposes it via unified billing.

3. Automatic provider fallback

When OpenAI returns a 503 or your request hits a rate limit, the gateway retries on a configured backup model. Most gateways do this; the difference is how granular the rules are. Can you fall back to a different provider entirely? A different model within the same provider? Different region? Different timeout per fallback step? Pay attention to the configuration surface, not the marketing checkbox.

Anthropic 503 → Bedrock takes over. Your app sees a successful 200. Same model, different cloud. The detour adds a small amount of latency only in the failure case.

4. Caching — semantic and exact-match

Exact-match cache is straightforward: same prompt, same params, same response. Easy 10-30% cost savings on workloads with repeat queries. Semantic cache is the harder variant: embed the prompt, find similar past prompts within a similarity threshold, return the cached response. Adds 30-80% savings on chatbot workloads but introduces non-trivial risk (returning a stale answer to a slightly different question). Configurable thresholds and TTL are mandatory.

5. Request logging and tracing

Every request and response captured, with token counts, latency, and the model that served it. Plus parent/child spans for multi-step agents. This is where the line between "gateway" and "observability platform" blurs — and where the bundle matters. A pure gateway gives you a CSV; a bundled platform gives you a flame graph with eval scores.

6. Self-hosted model support

The gateway must let you point at your own vLLM, Ollama, or any OpenAI-compatible endpoint as if it were a managed provider. Non-negotiable for any team running a fine-tuned model in production. The pattern is universal but the ergonomics vary wildly — some gateways treat self-hosted as a second-class citizen with missing features (no streaming, no caching).

7. PII / policy enforcement at the edge

Redact or block requests containing PII, profanity, or prompt-injection patterns before they reach a third-party provider. Used to be enterprise-only; now table stakes for any team in healthcare, finance, or anything customer-facing in the EU. Kong and TrueFoundry lean on this heavily; Respan ships it in the gateway by default; LiteLLM and OpenRouter expect you to bring your own.

The 8 LLM gateways compared (May 2026)

Every cell verified against the vendor's own documentation or product page on May 16, 2026. Where a claim is unmeasured ("lightning-fast" with no number) we mark it as not disclosed rather than guessing.

Gateway	Models supported	Deployment	Latency overhead	Fallback	Self-host	Caching	Per-key budgets	Observability
Respanus	100+	Cloud + Enterprise self-host	<50ms median	Yes	Yes	Semantic + exact	Yes	Built-in tracing + evals
LiteLLM	100+	OSS self-host + paid cloud	Not disclosed	Yes	Yes	Plugin	Virtual keys	Via integrations
Portkey	1,600+	OSS self-host + cloud	Claimed, no measurement	Yes	Yes	Yes	Yes	Built-in
OpenRouter	300+	Cloud only	Not disclosed	Yes	No	No	Credit-based	No
Helicone	100+	Cloud (gateway); OSS app	Not disclosed	Yes	No	Partial	Unified billing	Built-in
TrueFoundry	250+ direct / 1,600+ unified	Cloud + on-prem + air-gapped	Sub-3ms internal (routing only)	Yes	Yes	Partial	Token quota	Yes
Kong AI Gateway	Multi-provider (count not published)	OSS Kong + cloud Konnect	Not disclosed	Via plugins	Yes	Semantic	Token quota	Via plugins
Braintrust	Not published	Cloud only	Not disclosed	Yes	No	Encrypted cache	Yes	Built-in evals

Which gateway should you pick? A decision tree

The "best" gateway depends entirely on your team's deployment constraints and what else you need bundled. Pick the path that matches you.

You want fully open-source + self-hosted with no vendor lock-in. Pick LiteLLM. The proxy is OSS, Docker-deployable, and has the deepest provider list among truly free options. The trade-off: you'll wire up your own observability, evals, and prompt management afterward.
You want maximum model variety with zero infrastructure. Pick OpenRouter. 300+ models behind one credit-based account, no DevOps. The trade-off: you cannot self-host, you cannot do PII redaction, and you're paying per-token markup on every call.
You need enterprise governance + on-prem. Pick TrueFoundry or Kong AI. Both ship air-gapped or VPC-deployable, both have mature PII and rate-limiting plugins. Kong is the choice if you already run Kong API Gateway elsewhere; TrueFoundry is the choice if you want LLM features deeper out of the box.
You want OSS gateway + commercial polish. Pick Portkey. The gateway itself is open-source with 10K+ GitHub stars, and the managed cloud adds dashboards and enterprise auth on top.
You want zero markup on provider pricing. Pick Helicone. Their 0% markup model passes provider prices straight through. The trade-off: gateway is cloud-only, and self-host is the OSS app, not the gateway.
You want the full LLM ops loop in one product — gateway, observability, evals, and prompt management with shared auth and a single bill. Pick Respan or Braintrust. Braintrust leads with evals and adds a gateway; Respan leads with gateway + tracing and adds evals. Same fork in the road, opposite starting points.

Code: switch to a gateway in 60 seconds

Because every serious gateway is OpenAI-compatible, you change a base URL and an API key. The example uses Respan's gateway, but the same shape applies to LiteLLM, Portkey, OpenRouter, and Helicone — only the base URL and key change.

from openai import OpenAI
import os

# Before — direct to OpenAI
client = OpenAI(api_key="sk-...")

# After — through Respan's gateway. Same SDK, no other code change.
client = OpenAI(
    api_key=os.environ["RESPAN_API_KEY"],  # virtual key, per-team budget enforced
    base_url="https://api.respan.ai/v1",
)

# Now you can switch model providers without touching this file:
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",  # or openai/gpt-5, gemini/pro, etc.
    messages=[{"role": "user", "content": "Hello"}],
)

That's the entire migration. Every request now passes through the gateway: logged, evaluated against your policies, optionally cached, and counted against your virtual key's budget. Switching models is a string change — no application redeploys.

How much latency does a gateway really add?

Every gateway claims it's fast. Numbers are scarce because the honest answer is "it depends on three things you can't read from a marketing page."

Region. Cross-region adds 80–200ms. Gateway-in-us-east-1, OpenAI-in-us-east-1 — single-digit ms is realistic. Gateway-in-eu-west, OpenAI-in-us-east — your math is 80ms+ before the LLM even starts thinking.
Synchronous logging. Some gateways block the response until the trace is written to durable storage. This adds 20-50ms per request. Better gateways write to a queue and return immediately.
Cache check + policy. Semantic cache requires an embedding lookup (5-20ms). Policy evaluation adds 1-5ms per rule. Both are unavoidable but configurable.

Respan measures sub-50ms median overhead end-to-end against its primary US region (May 2026, internal load test). TrueFoundry claims "sub-3ms internal latency" but that's the routing layer only, not the full round-trip. Most vendors don't publish anything. If your application is TTFT-sensitive (voice, real-time chat), test before you commit.

Failure modes nobody mentions

A gateway is a single point of dependency for every LLM call your application makes. That's the deal. Three production failures to plan for before you deploy one.

Cascading retries

The gateway retries on the upstream provider. Your application retries on the gateway. The end user retries by clicking the button again. Three layers of retry compound into a thundering herd that takes down the gateway during a partial provider outage. Mitigation: configure explicit retry budgets on every layer (gateway: 2 retries, app: 0 retries, frontend: exponential backoff).

Semantic cache poisoning

Semantic cache returns a stored response when the new prompt's embedding is "close enough" to a past prompt. Set the threshold too loose and you start returning the wrong answer to subtly different questions. Set it too tight and the cache hits zero. Threshold calibration with a held-out eval set is mandatory before you ship this to production.

Streaming + tool calls

Streaming responses with tool calls are still the messiest part of the OpenAI-compatible spec. Different providers emit different partial-tool-call chunk formats. Gateways translate them imperfectly. If your product uses streamed tool calls, run a 24-hour soak test against your candidate gateway before committing — this is where the weird intermittent bugs live.

See Respan's gateway in 5 minutes

One OpenAI-compatible endpoint, 100+ models, automatic failover, semantic caching, per-key budgets — and full trace + eval + prompt management built in. Free to try, no credit card.

Try Respan free See gateway features

Migrating between gateways without rewriting your app

Because every gateway exposes OpenAI-compatible endpoints, the SDK side of the migration is trivial — change base URL, change API key. The hard parts:

Trace history is not portable. Your six months of captured prompts on Gateway A don't move to Gateway B. Plan a one-time export of trace data into your warehouse if you need to retain it.
Eval datasets curated from production traces are gateway-specific. You'll need to rebuild them — or, if you've been versioning them in a system outside the gateway (you should be), this is painless.
Prompt versions and experiments need to be exported or recreated. Most gateways expose this via API; budget half a day per ten prompts.
Dual-run for a week. Send 10% of traffic to the new gateway, watch error rates and eval scores. If they match, flip the ratio incrementally.

Production checklist before shipping a gateway

Configure at least one fallback model per primary, with a different provider where possible.
Set retry budgets on the gateway AND on the application — never rely on a single layer to handle backoff.
Issue virtual keys per environment (dev/staging/prod) and per team, each with a monthly budget cap.
Enable async (non-blocking) request logging so traces don't add latency to your hot path.
If using semantic cache, run a 1,000-request eval against your actual traffic with the threshold you intend to use, before enabling it for users.
Test streaming + tool calls against your top two providers for at least 24 hours of real traffic. Streaming bugs hide.
Wire trace export from the gateway into your existing observability stack (Datadog, Honeycomb, etc.) so the gateway is not a silo.
Document the on-call procedure for "the gateway itself is down." You need it.

Related comparisons

Want to dig into specific tools? Each of these is a head-to-head with Respan plus integration notes.

LiteLLM alternatives — when the OSS proxy stops scaling.
Portkey alternatives — Portkey vs the commercial managed gateways.
OpenRouter alternatives — when you outgrow credit-based routing.
Helicone alternatives — Helicone vs the dedicated observability platforms.
LLM Observability: the complete guide — the next layer of the stack after the gateway.

Frequently asked questions

TL;DR

80M+

LLM requests / day on Respan

<50ms

median gateway overhead

~35%

avg cost reduction via cache + routing

100+

model providers supported

What an LLM gateway actually does

Why you start needing one

Teams hit the gateway-shaped problem at roughly the same five milestones. If you recognize three or more of these, you've already outgrown direct provider SDK calls.

The second provider. The day you add Anthropic next to OpenAI is the day your app starts caring about request schemas, retry behavior, streaming differences, and tool-call formats that don't match. Either your app handles all that, or the gateway does.
The second team. Marketing wants Claude Sonnet for copy. The backend team uses GPT-5 mini for classification. Both have separate API keys with no shared budget. Cost goes up linearly with team count and nobody can see who is spending what.
The first outage. OpenAI degrades for 40 minutes. Your customer-support agent stops working. You have no failover plan and 200 angry support tickets. The gateway is where the fallback rule lives.
The CFO question. "Why did our LLM bill go from $12k to $48k last month?" Without per-key, per-team, per-customer attribution, you cannot answer this. Most provider dashboards show you the bill but not the breakdown.
The compliance ask. Legal wants every prompt and response that involved a customer to be logged in your system, not a third party. SOC 2, HIPAA, GDPR. The gateway is the choke point where that logging happens — and the only realistic place to enforce PII redaction before data leaves your network.

Founder's take

Frank Chen · Head of DevRel, Respan

Teams running LLM apps in production with Respan

Tired of stitching LiteLLM + LangSmith + Promptfoo?

Respan ships gateway, tracing, evals, and prompt management as one product with one auth and one bill. ~35% average cost reduction via routing and caching.

Try Respan free

The seven things a production gateway must do

Most posts on this topic list "features" without telling you which ones are table stakes and which are nice-to-have. Here's the practical cut.

1. Unified OpenAI-compatible endpoint

2. Per-key budgets and rate limits

3. Automatic provider fallback

Anthropic 503 → Bedrock takes over. Your app sees a successful 200. Same model, different cloud. The detour adds a small amount of latency only in the failure case.

4. Caching — semantic and exact-match

5. Request logging and tracing

6. Self-hosted model support

7. PII / policy enforcement at the edge

The 8 LLM gateways compared (May 2026)

Gateway	Models supported	Deployment	Latency overhead	Fallback	Self-host	Caching	Per-key budgets	Observability
Respanus	100+	Cloud + Enterprise self-host	<50ms median	Yes	Yes	Semantic + exact	Yes	Built-in tracing + evals
LiteLLM	100+	OSS self-host + paid cloud	Not disclosed	Yes	Yes	Plugin	Virtual keys	Via integrations
Portkey	1,600+	OSS self-host + cloud	Claimed, no measurement	Yes	Yes	Yes	Yes	Built-in
OpenRouter	300+	Cloud only	Not disclosed	Yes	No	No	Credit-based	No
Helicone	100+	Cloud (gateway); OSS app	Not disclosed	Yes	No	Partial	Unified billing	Built-in
TrueFoundry	250+ direct / 1,600+ unified	Cloud + on-prem + air-gapped	Sub-3ms internal (routing only)	Yes	Yes	Partial	Token quota	Yes
Kong AI Gateway	Multi-provider (count not published)	OSS Kong + cloud Konnect	Not disclosed	Via plugins	Yes	Semantic	Token quota	Via plugins
Braintrust	Not published	Cloud only	Not disclosed	Yes	No	Encrypted cache	Yes	Built-in evals

Which gateway should you pick? A decision tree

The "best" gateway depends entirely on your team's deployment constraints and what else you need bundled. Pick the path that matches you.

You want fully open-source + self-hosted with no vendor lock-in. Pick LiteLLM. The proxy is OSS, Docker-deployable, and has the deepest provider list among truly free options. The trade-off: you'll wire up your own observability, evals, and prompt management afterward.
You want maximum model variety with zero infrastructure. Pick OpenRouter. 300+ models behind one credit-based account, no DevOps. The trade-off: you cannot self-host, you cannot do PII redaction, and you're paying per-token markup on every call.
You need enterprise governance + on-prem. Pick TrueFoundry or Kong AI. Both ship air-gapped or VPC-deployable, both have mature PII and rate-limiting plugins. Kong is the choice if you already run Kong API Gateway elsewhere; TrueFoundry is the choice if you want LLM features deeper out of the box.
You want OSS gateway + commercial polish. Pick Portkey. The gateway itself is open-source with 10K+ GitHub stars, and the managed cloud adds dashboards and enterprise auth on top.
You want zero markup on provider pricing. Pick Helicone. Their 0% markup model passes provider prices straight through. The trade-off: gateway is cloud-only, and self-host is the OSS app, not the gateway.
You want the full LLM ops loop in one product — gateway, observability, evals, and prompt management with shared auth and a single bill. Pick Respan or Braintrust. Braintrust leads with evals and adds a gateway; Respan leads with gateway + tracing and adds evals. Same fork in the road, opposite starting points.

Code: switch to a gateway in 60 seconds

from openai import OpenAI
import os

# Before — direct to OpenAI
client = OpenAI(api_key="sk-...")

# After — through Respan's gateway. Same SDK, no other code change.
client = OpenAI(
    api_key=os.environ["RESPAN_API_KEY"],  # virtual key, per-team budget enforced
    base_url="https://api.respan.ai/v1",
)

# Now you can switch model providers without touching this file:
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",  # or openai/gpt-5, gemini/pro, etc.
    messages=[{"role": "user", "content": "Hello"}],
)

How much latency does a gateway really add?

Every gateway claims it's fast. Numbers are scarce because the honest answer is "it depends on three things you can't read from a marketing page."

Region. Cross-region adds 80–200ms. Gateway-in-us-east-1, OpenAI-in-us-east-1 — single-digit ms is realistic. Gateway-in-eu-west, OpenAI-in-us-east — your math is 80ms+ before the LLM even starts thinking.
Synchronous logging. Some gateways block the response until the trace is written to durable storage. This adds 20-50ms per request. Better gateways write to a queue and return immediately.
Cache check + policy. Semantic cache requires an embedding lookup (5-20ms). Policy evaluation adds 1-5ms per rule. Both are unavoidable but configurable.

Failure modes nobody mentions

A gateway is a single point of dependency for every LLM call your application makes. That's the deal. Three production failures to plan for before you deploy one.

Cascading retries

Semantic cache poisoning

Streaming + tool calls

See Respan's gateway in 5 minutes

One OpenAI-compatible endpoint, 100+ models, automatic failover, semantic caching, per-key budgets — and full trace + eval + prompt management built in. Free to try, no credit card.

Try Respan free See gateway features

Migrating between gateways without rewriting your app

Because every gateway exposes OpenAI-compatible endpoints, the SDK side of the migration is trivial — change base URL, change API key. The hard parts:

Trace history is not portable. Your six months of captured prompts on Gateway A don't move to Gateway B. Plan a one-time export of trace data into your warehouse if you need to retain it.
Eval datasets curated from production traces are gateway-specific. You'll need to rebuild them — or, if you've been versioning them in a system outside the gateway (you should be), this is painless.
Prompt versions and experiments need to be exported or recreated. Most gateways expose this via API; budget half a day per ten prompts.
Dual-run for a week. Send 10% of traffic to the new gateway, watch error rates and eval scores. If they match, flip the ratio incrementally.

Production checklist before shipping a gateway

Configure at least one fallback model per primary, with a different provider where possible.
Set retry budgets on the gateway AND on the application — never rely on a single layer to handle backoff.
Issue virtual keys per environment (dev/staging/prod) and per team, each with a monthly budget cap.
Enable async (non-blocking) request logging so traces don't add latency to your hot path.
If using semantic cache, run a 1,000-request eval against your actual traffic with the threshold you intend to use, before enabling it for users.
Test streaming + tool calls against your top two providers for at least 24 hours of real traffic. Streaming bugs hide.
Wire trace export from the gateway into your existing observability stack (Datadog, Honeycomb, etc.) so the gateway is not a silo.
Document the on-call procedure for "the gateway itself is down." You need it.

Related comparisons

Want to dig into specific tools? Each of these is a head-to-head with Respan plus integration notes.

LiteLLM alternatives — when the OSS proxy stops scaling.
Portkey alternatives — Portkey vs the commercial managed gateways.
OpenRouter alternatives — when you outgrow credit-based routing.
Helicone alternatives — Helicone vs the dedicated observability platforms.
LLM Observability: the complete guide — the next layer of the stack after the gateway.

AI Gateway: A Practitioner's Guide (2026)

What an LLM gateway actually does

Why you start needing one

Tired of stitching LiteLLM + LangSmith + Promptfoo?

The seven things a production gateway must do

1. Unified OpenAI-compatible endpoint

2. Per-key budgets and rate limits

3. Automatic provider fallback

4. Caching — semantic and exact-match

5. Request logging and tracing

6. Self-hosted model support

7. PII / policy enforcement at the edge

The 8 LLM gateways compared (May 2026)

Which gateway should you pick? A decision tree

Code: switch to a gateway in 60 seconds

How much latency does a gateway really add?

Failure modes nobody mentions

Cascading retries

Semantic cache poisoning

Streaming + tool calls

See Respan's gateway in 5 minutes

Migrating between gateways without rewriting your app

Production checklist before shipping a gateway

Related comparisons

Frequently asked questions

What is an LLM gateway?

Do I really need an LLM gateway?

Is an LLM gateway the same as an AI gateway?

Is LiteLLM the same as a gateway?

Should I self-host or use a cloud gateway?

How much latency does a gateway add?

Can I switch gateways without rewriting my application?

What's the difference between an LLM gateway and a model router?

Do LLM gateways work with self-hosted models like vLLM or Ollama?

What about Bedrock Converse or Vertex's unified API — are those gateways?

Built for AI agents. Break less. Ship more.

AI Gateway: A Practitioner's Guide (2026)

What an LLM gateway actually does

Why you start needing one

Tired of stitching LiteLLM + LangSmith + Promptfoo?

The seven things a production gateway must do

1. Unified OpenAI-compatible endpoint

2. Per-key budgets and rate limits

3. Automatic provider fallback

4. Caching — semantic and exact-match

5. Request logging and tracing

6. Self-hosted model support

7. PII / policy enforcement at the edge

The 8 LLM gateways compared (May 2026)

Which gateway should you pick? A decision tree

Code: switch to a gateway in 60 seconds

How much latency does a gateway really add?

Failure modes nobody mentions

Cascading retries

Semantic cache poisoning

Streaming + tool calls

See Respan's gateway in 5 minutes

Migrating between gateways without rewriting your app

Production checklist before shipping a gateway

Related comparisons

Frequently asked questions

What is an LLM gateway?

Do I really need an LLM gateway?

Is an LLM gateway the same as an AI gateway?

Is LiteLLM the same as a gateway?

Should I self-host or use a cloud gateway?

How much latency does a gateway add?

Can I switch gateways without rewriting my application?

What's the difference between an LLM gateway and a model router?

Do LLM gateways work with self-hosted models like vLLM or Ollama?

What about Bedrock Converse or Vertex's unified API — are those gateways?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.