What's the difference between an LLM gateway and an AI gateway?

Mostly marketing. 'AI gateway' is the broader term, often pushed by API-management vendors (Cloudflare, Kong) to cover any AI traffic — including image, audio, embeddings. 'LLM gateway' is narrower and product-team focused. The capabilities overlap heavily; pick the one whose feature set matches your stack.

Why do I need a gateway if I only use OpenAI?

Three reasons even single-provider teams adopt one: (1) OpenAI itself has provider-level outages — Azure OpenAI fallback saves you. (2) Cost guardrails per user/feature without writing them yourself. (3) The day you do want to evaluate Anthropic or a fine-tune, the gateway abstracts it without app code changes.

Does an LLM gateway add latency?

Yes, but not much. A well-designed proxy adds 5-15ms of overhead. With caching enabled, the gateway often reduces median latency because cache hits return in single-digit milliseconds. Streaming is preserved end-to-end on any modern gateway.

Can a gateway handle streaming responses?

Yes. Any production-grade LLM gateway supports streaming SSE end-to-end with no buffering. If a vendor advertises a gateway that doesn't stream, walk away.

What's the difference between an LLM gateway and an LLM router?

Routers (e.g., choose a cheap model for easy queries, a strong model for hard ones) are usually a feature of a gateway, not a separate product. The gateway is the proxy + control plane; the router is one policy that runs inside it.

How does fallback work across providers?

On a primary-provider error (timeout, 5xx, rate limit), the gateway retries against a configured secondary — usually with the closest equivalent model. Anthropic Sonnet primary, Bedrock Sonnet fallback. OpenAI GPT-4o primary, Azure GPT-4o fallback. Your app sees one successful response.

What does an LLM gateway typically cache?

Exact-match caching by request body (with deterministic settings) is universal. Semantic caching — matching by embedding similarity — is increasingly common but has correctness trade-offs you should understand before enabling. Cache TTL is usually configurable per route.

How do I evaluate LLM gateways?

Five questions: (1) How many models are supported and how fast does the catalog update? (2) Is it OpenAI-compatible (drop-in)? (3) Does it pair with observability and evals or is it standalone? (4) What's the streaming and latency overhead? (5) Self-host option? See the comparison table on this page.

LLM Gateway: One API for 500+ Models, Fallback, and Cost Control

Q: What is an LLM gateway?

An LLM gateway is a unified proxy in front of multiple LLM providers. Your application makes one OpenAI-compatible API call; the gateway handles model routing, provider fallback, caching, rate limiting, cost tracking, and observability. It decouples your app code from any single provider.

Frank Chen · Head of DevRel, Respan

Last updated May 10, 2026 · 11 min read

TL;DR

An LLM gateway is a unified, OpenAI-compatible proxy in front of every model provider — OpenAI, Anthropic, Google, Bedrock, Azure, open-source endpoints. It handles routing, provider fallback, caching, rate limiting, cost guardrails, and observability so your application code stays simple. Most teams realize they need one around month 6 of production; the smart ones adopt it on day one.

500+

models routable through Respan

80M+

LLM requests / day proxied

~10ms

added P95 latency overhead

99.99%

SLA with multi-provider fallback

What is an LLM gateway?

An LLM gateway is a single API that sits between your application and every LLM provider you might call — OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, Mistral, fine-tuned and open-source endpoints. Your app code makes one request in OpenAI-compatible format; the gateway dispatches to the chosen provider, handles retries and fallback, applies caching and rate limits, attaches observability, and returns the response.

The pattern borrows from API gateways in microservices (Kong, Apigee, Tyk) but specializes for LLM workloads: streaming, token accounting, model-aware routing, and the eccentricities of each provider's API.

Why teams adopt a gateway

Five forcing functions, in roughly the order teams hit them:

A provider has an outage. Anthropic returns 503s for 20 minutes. Without fallback, your customer support agent is dead. With a gateway, you fail over to Bedrock Sonnet and users notice nothing.
Cost spikes mysteriously. Some feature or some user is consuming 10× the average. Without per-call attribution and budget caps, you find out at month-end. Gateways enforce per-feature, per-user budgets in real time.
You want to test a new model. A new Claude variant ships. Without abstraction, comparing it to your current model means changing app code. With a gateway, it's a config change and a 5% traffic split.
Rate limits become a tax. Provider rate limits are per-account, not per-feature. The gateway lets you shape traffic so no single feature starves the others.
Compliance asks where prompts are going. A gateway is the natural choke point for redaction, audit logs, and regional routing.

Founder's take

Frank Chen · Head of DevRel, Respan

The case for a gateway is risk management more than abstraction. Provider downtime is the silent killer — Anthropic 503s in a customer support agent destroy user experience without even paging on-call, because your app technically returns 200s through retries until it finally gives up. The right time to add a gateway is before the outage, not after.

The teams I've watched ship without one all hit the same milestone: a 30-minute provider degradation that costs them more than 3 years of gateway licensing. If you're past month 6 in production and still calling providers directly, you're shipping single-provider lock-in to your customers — and they're paying for it the next time the model burps.

Teams routing through Respan in production

Core capabilities

Unified API (OpenAI-compatible)

Every modern gateway exposes the OpenAI Chat Completions API as the lingua franca. Your app code uses openai-python or openai-node pointed at the gateway base URL. The gateway translates to whatever provider's native API on the back end. This is the abstraction that pays back every time you swap models.

Provider fallback

On a primary-provider error (5xx, timeout, rate limit), the gateway retries against a configured secondary. The flow:

Anthropic 503 → Bedrock takes over. Your app sees a successful 200. The detour adds a small amount of latency in the failure case; cache hits on the happy path return in single-digit milliseconds.

Common pairings:

Anthropic Sonnet → Bedrock Sonnet (same model, different cloud)
OpenAI GPT-4o → Azure OpenAI GPT-4o
Google Gemini Pro → Vertex AI Gemini Pro

The fallback should be transparent to your app — same response shape, same streaming behavior. If your gateway requires app code changes for fallback, it's not really a gateway.

Caching

Two flavors:

Exact-match cache: hash the request body, return the prior response. Safe and high-hit for deterministic settings (temperature 0, fixed seed).
Semantic cache: embed the prompt, match by similarity threshold. Higher hit rate, but the false-positive case is shipping a stale answer to a slightly-different question. Enable carefully.

Rate limiting and budgets

Provider rate limits are blunt. A gateway lets you set finer-grained budgets — per user, per feature, per workspace, per minute, per day, per dollar. The right shape lets a "free tier" feature starve when over budget without killing your "premium" feature.

Observability + evals integration

Standalone gateways send you logs, but the value compounds when the gateway is part of a platform that also handles tracing, observability, and evals. Every model call becomes a span in the trace, scored by your evaluators, attributed to its prompt version. Respan's gateway is built around this.

How to instrument

OpenAI-compatible drop-in. The total app-code change to route through Respan's gateway:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.respan.ai/v1",
    api_key=os.environ["RESPAN_API_KEY"],
)

# Now call any of 500+ models with the same client
client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet",
    messages=[{"role": "user", "content": "..."}],
    extra_headers={
        "x-respan-feature": "support_agent",
        "x-respan-fallback": "bedrock/anthropic.claude-3-5-sonnet",
    },
)

Switching to GPT-4o is a one-character change. Adding fallback to Azure is a header. Routing 5% of traffic to a fine-tune for an A/B test is a config change.

What fallback actually saves you

Provider outages happen on a regular cadence — every major LLM API has had multi-minute regional capacity events in the last twelve months. With a gateway and configured fallback, your customer-support agent stays on Bedrock Sonnet while Anthropic Sonnet is returning 5xx, your app sees zero failed requests, and the incident shows up in your post-mortem only because someone noticed the spike in fallback rate on a dashboard.

The user impact in those windows is usually a small bump in median latency from the retry hop and nothing else. Total engineer time spent reacting: zero. Total customer impact: invisible. That's the case for the gateway in one sentence.

The teams that don't have a gateway live the inverse story every quarter. The math doesn't get better as you add providers; the entire industry's effective uptime becomes whichever provider you've coupled to.

Common gateway mistakes

No fallback configured. The gateway is half a gateway. Fix on day one.
Semantic cache enabled by default. The wrong cache hit ships a stale or wrong answer. Start with exact-match only; turn on semantic only after you understand your false-positive tolerance.
Treating the gateway as the place to do model routing logic. Routing rules ("if query is short, use cheap model") belong in your app or a separate router service — not buried in gateway config nobody reads.
Forgetting cost guardrails. The first runaway agent that makes a recursive tool call drains the monthly budget in 2 hours.
Single-tenant gateway shared across teams. Each team's bursty traffic affects the others. Use workspaces or rate-limit per logical tenant.

LLM gateways compared

Six options most teams evaluate. Best read as of May 2026.

Tool	Models supported	OpenAI-compatible	Provider fallback	Caching	Rate limiting	Cost guardrails	Self-host	Observability
Respan Gateway	500+	Yes	Yes	Yes	Yes	Yes	Yes (Enterprise)	Built-in
OpenRouter	300+	Yes	No	No	Yes	No	No	Standalone
LiteLLM	100+	Yes	Yes	Yes	Yes	Partial	Yes (OSS)	Standalone
Portkey	250+	Yes	Yes	Yes	Yes	Yes	Enterprise only	Built-in
Cloudflare AI Gateway	50+	Partial	Yes	Yes	Yes	Yes	No	Basic
Helicone	100+	Yes	Yes	Yes	Yes	No	Yes (OSS)	Built-in

Frequently asked questions

Frank Chen

Head of DevRel, Respan

Head of DevRel at Respan (YC W24). Working alongside the team running the infrastructure that handles 80M+ LLM requests a day.

Connect on LinkedIn →

Route 500+ models through one API

OpenAI-compatible drop-in. Provider fallback, caching, cost guardrails, and full observability — all in one.

Start for free See gateway in product

Related guides: LLM observability · LLM tracing · Gateway in Respan

Frank Chen · Head of DevRel, Respan

Last updated May 10, 2026 · 11 min read

TL;DR

500+

models routable through Respan

80M+

LLM requests / day proxied

~10ms

added P95 latency overhead

99.99%

SLA with multi-provider fallback

What is an LLM gateway?

Why teams adopt a gateway

Five forcing functions, in roughly the order teams hit them:

A provider has an outage. Anthropic returns 503s for 20 minutes. Without fallback, your customer support agent is dead. With a gateway, you fail over to Bedrock Sonnet and users notice nothing.
Cost spikes mysteriously. Some feature or some user is consuming 10× the average. Without per-call attribution and budget caps, you find out at month-end. Gateways enforce per-feature, per-user budgets in real time.
You want to test a new model. A new Claude variant ships. Without abstraction, comparing it to your current model means changing app code. With a gateway, it's a config change and a 5% traffic split.
Rate limits become a tax. Provider rate limits are per-account, not per-feature. The gateway lets you shape traffic so no single feature starves the others.
Compliance asks where prompts are going. A gateway is the natural choke point for redaction, audit logs, and regional routing.

Founder's take

Frank Chen · Head of DevRel, Respan

Teams routing through Respan in production

Core capabilities

Unified API (OpenAI-compatible)

Provider fallback

On a primary-provider error (5xx, timeout, rate limit), the gateway retries against a configured secondary. The flow:

Anthropic 503 → Bedrock takes over. Your app sees a successful 200. The detour adds a small amount of latency in the failure case; cache hits on the happy path return in single-digit milliseconds.

Common pairings:

Anthropic Sonnet → Bedrock Sonnet (same model, different cloud)
OpenAI GPT-4o → Azure OpenAI GPT-4o
Google Gemini Pro → Vertex AI Gemini Pro

The fallback should be transparent to your app — same response shape, same streaming behavior. If your gateway requires app code changes for fallback, it's not really a gateway.

Caching

Two flavors:

Exact-match cache: hash the request body, return the prior response. Safe and high-hit for deterministic settings (temperature 0, fixed seed).
Semantic cache: embed the prompt, match by similarity threshold. Higher hit rate, but the false-positive case is shipping a stale answer to a slightly-different question. Enable carefully.

Rate limiting and budgets

Observability + evals integration

How to instrument

OpenAI-compatible drop-in. The total app-code change to route through Respan's gateway:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.respan.ai/v1",
    api_key=os.environ["RESPAN_API_KEY"],
)

# Now call any of 500+ models with the same client
client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet",
    messages=[{"role": "user", "content": "..."}],
    extra_headers={
        "x-respan-feature": "support_agent",
        "x-respan-fallback": "bedrock/anthropic.claude-3-5-sonnet",
    },
)

Switching to GPT-4o is a one-character change. Adding fallback to Azure is a header. Routing 5% of traffic to a fine-tune for an A/B test is a config change.

What fallback actually saves you

Common gateway mistakes

No fallback configured. The gateway is half a gateway. Fix on day one.
Semantic cache enabled by default. The wrong cache hit ships a stale or wrong answer. Start with exact-match only; turn on semantic only after you understand your false-positive tolerance.
Treating the gateway as the place to do model routing logic. Routing rules ("if query is short, use cheap model") belong in your app or a separate router service — not buried in gateway config nobody reads.
Forgetting cost guardrails. The first runaway agent that makes a recursive tool call drains the monthly budget in 2 hours.
Single-tenant gateway shared across teams. Each team's bursty traffic affects the others. Use workspaces or rate-limit per logical tenant.

LLM gateways compared

Six options most teams evaluate. Best read as of May 2026.

Tool	Models supported	OpenAI-compatible	Provider fallback	Caching	Rate limiting	Cost guardrails	Self-host	Observability
Respan Gateway	500+	Yes	Yes	Yes	Yes	Yes	Yes (Enterprise)	Built-in
OpenRouter	300+	Yes	No	No	Yes	No	No	Standalone
LiteLLM	100+	Yes	Yes	Yes	Yes	Partial	Yes (OSS)	Standalone
Portkey	250+	Yes	Yes	Yes	Yes	Yes	Enterprise only	Built-in
Cloudflare AI Gateway	50+	Partial	Yes	Yes	Yes	Yes	No	Basic
Helicone	100+	Yes	Yes	Yes	Yes	No	Yes (OSS)	Built-in

Frequently asked questions

Frank Chen

Head of DevRel, Respan

Head of DevRel at Respan (YC W24). Working alongside the team running the infrastructure that handles 80M+ LLM requests a day.

Connect on LinkedIn →

Route 500+ models through one API

OpenAI-compatible drop-in. Provider fallback, caching, cost guardrails, and full observability — all in one.

Start for free See gateway in product

Related guides: LLM observability · LLM tracing · Gateway in Respan

LLM Gateway: One API for Every Model

What is an LLM gateway?

Why teams adopt a gateway

Core capabilities

Unified API (OpenAI-compatible)

Provider fallback

Caching

Rate limiting and budgets

Observability + evals integration

How to instrument

What fallback actually saves you

Common gateway mistakes

LLM gateways compared

Frequently asked questions

Route 500+ models through one API

Ship reliable AI agents

LLM Gateway: One API for Every Model

What is an LLM gateway?

Why teams adopt a gateway

Core capabilities

Unified API (OpenAI-compatible)

Provider fallback

Caching

Rate limiting and budgets

Observability + evals integration

How to instrument

What fallback actually saves you

Common gateway mistakes

LLM gateways compared

Frequently asked questions

Route 500+ models through one API

Ship reliable AI agents