An LLM gateway is a unified, OpenAI-compatible proxy in front of every model provider — OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, open-source endpoints. Your application makes one API call; the gateway handles routing, provider fallback, caching, rate limiting, cost guardrails, and observability so your application code stays simple.

For the comprehensive treatment, see our LLM Gateway pillar guide. This is the short version.

TL;DR

A gateway sits between your application and LLM providers. It does five things:

Unified API — one OpenAI-compatible interface across 100s of models
Provider fallback — when one provider has an outage, automatically retry on another
Caching — exact-match and/or semantic, dramatically cuts cost on repeated queries
Rate limiting and budgets — per user, per feature, per dollar
Observability — every call captured with cost, latency, model, attribution

If you're running production LLM workloads past month 6, you need one.

Why teams adopt a gateway

Five forcing functions, in order:

Provider outage. Anthropic returns 503s for 25 minutes. Without fallback, your support agent is dead. With a gateway, you fail over to Bedrock and users notice nothing.
Cost spike. Some feature or user is consuming 10× the average. Without per-call attribution, you find out at month-end.
Model evaluation. You want to test Claude vs GPT for a feature. Without a gateway, comparing means changing app code. With a gateway, it's a config change and a 5% traffic split.
Rate limits become a tax. Provider limits are per-account. The gateway shapes traffic so no single feature starves the others.
Compliance asks where prompts go. A gateway is the natural choke point for redaction, audit logs, regional routing.

Core capabilities

Unified API

Every modern gateway exposes the OpenAI Chat Completions API as the lingua franca. Your app uses openai-python or openai-node pointed at the gateway base URL. The gateway translates to whatever provider's native API on the back end.

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.respan.ai/v1",
    api_key=os.environ["RESPAN_API_KEY"],
)
 
# Now call any of 500+ models with the same client
client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[...],
)

Switching to GPT-5.5 is a one-character change. Adding fallback to Azure is a header.

Provider fallback

On 5xx, timeout, or rate limit from the primary, the gateway retries against a configured secondary — usually the same model on a different cloud:

Anthropic Sonnet → Bedrock Sonnet (same model, different cloud)
OpenAI GPT-5.5 → Azure OpenAI GPT-5.5
Google Gemini → Vertex AI Gemini

The fallback is transparent to your app. Same response shape, same streaming behavior.

Caching

Two flavors:

Exact-match cache — hash request body, return prior response. Safe and high-hit for deterministic settings (temperature 0).
Semantic cache — embed the prompt, match by similarity threshold. Higher hit rate, but the false-positive case ships stale answers. Enable carefully.

Caching can drop median latency to single-digit ms (cache hit returns immediately) and cut cost dramatically for repeated content.

Rate limiting and budgets

Provider rate limits are blunt — per-account, not per-feature. A gateway lets you set finer-grained budgets per user, per feature, per workspace, per minute, per day, per dollar. The right shape lets a "free tier" feature starve when over budget without killing your "premium" feature.

Observability integration

Standalone gateways send you logs. The value compounds when the gateway is part of a platform that also handles tracing, evals, and prompt management — every model call becomes a span in the trace, scored by your evaluators, attributed to its prompt version. See LLM Observability.

LLM gateway vs AI gateway

The two terms overlap heavily. "AI gateway" is the broader marketing term, often used by API-management vendors covering image, audio, embeddings beyond text LLMs. "LLM gateway" is narrower and product-team-focused. The capabilities overlap; pick whichever matches your stack.

Common gateway mistakes

No fallback configured. It's half a gateway. Configure on day one.
Semantic cache enabled by default. Wrong cache hit ships stale answers. Start with exact-match only.
Treating the gateway as model-routing logic. Routing rules ("if query is short, use cheap model") belong in your app or a separate router service, not buried in gateway config.
Forgetting cost guardrails. First runaway agent drains the monthly budget in 2 hours.
Single-tenant gateway shared across teams. Each team's bursty traffic affects others. Use workspaces or rate-limit per logical tenant.

How to choose a gateway

See our Best LLM Gateways in 2026 listicle for a detailed comparison. Quick decision framework:

Want gateway + observability + evals + prompts in one? → Respan
Need maximum model variety? → OpenRouter
Need open-source self-host? → LiteLLM
Need enterprise governance? → Portkey
Already on Cloudflare? → Cloudflare AI Gateway
Already on Vercel? → Vercel AI Gateway

FAQ

Do I need a gateway from day one? For toy projects, no — call providers directly. For production, yes — you'll need one within 6 months and migrating production traffic later is painful.

Does a gateway add latency? A well-designed gateway adds 5-15ms P95 overhead. With caching, the gateway often reduces median latency.

Can a gateway handle streaming? Yes. Any production-grade gateway supports streaming SSE end-to-end.

What's the difference between a gateway and a router? A gateway is the proxy + control plane. A router is one policy ("send small queries to cheap model") that runs inside the gateway.

Should I build my own gateway? For most teams, no. Off-the-shelf gateways handle fallback, caching, rate limiting, and observability — building this in-house is months of work for a feature you can buy.

For the comprehensive treatment, see our LLM Gateway pillar guide. This is the short version.

TL;DR

A gateway sits between your application and LLM providers. It does five things:

Unified API — one OpenAI-compatible interface across 100s of models
Provider fallback — when one provider has an outage, automatically retry on another
Caching — exact-match and/or semantic, dramatically cuts cost on repeated queries
Rate limiting and budgets — per user, per feature, per dollar
Observability — every call captured with cost, latency, model, attribution

If you're running production LLM workloads past month 6, you need one.

Why teams adopt a gateway

Five forcing functions, in order:

Provider outage. Anthropic returns 503s for 25 minutes. Without fallback, your support agent is dead. With a gateway, you fail over to Bedrock and users notice nothing.
Cost spike. Some feature or user is consuming 10× the average. Without per-call attribution, you find out at month-end.
Model evaluation. You want to test Claude vs GPT for a feature. Without a gateway, comparing means changing app code. With a gateway, it's a config change and a 5% traffic split.
Rate limits become a tax. Provider limits are per-account. The gateway shapes traffic so no single feature starves the others.
Compliance asks where prompts go. A gateway is the natural choke point for redaction, audit logs, regional routing.

Core capabilities

Unified API

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.respan.ai/v1",
    api_key=os.environ["RESPAN_API_KEY"],
)
 
# Now call any of 500+ models with the same client
client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[...],
)

Switching to GPT-5.5 is a one-character change. Adding fallback to Azure is a header.

Provider fallback

On 5xx, timeout, or rate limit from the primary, the gateway retries against a configured secondary — usually the same model on a different cloud:

Anthropic Sonnet → Bedrock Sonnet (same model, different cloud)
OpenAI GPT-5.5 → Azure OpenAI GPT-5.5
Google Gemini → Vertex AI Gemini

The fallback is transparent to your app. Same response shape, same streaming behavior.

Caching

Two flavors:

Exact-match cache — hash request body, return prior response. Safe and high-hit for deterministic settings (temperature 0).
Semantic cache — embed the prompt, match by similarity threshold. Higher hit rate, but the false-positive case ships stale answers. Enable carefully.

Caching can drop median latency to single-digit ms (cache hit returns immediately) and cut cost dramatically for repeated content.

Rate limiting and budgets

Observability integration

LLM gateway vs AI gateway

Common gateway mistakes

No fallback configured. It's half a gateway. Configure on day one.
Semantic cache enabled by default. Wrong cache hit ships stale answers. Start with exact-match only.
Treating the gateway as model-routing logic. Routing rules ("if query is short, use cheap model") belong in your app or a separate router service, not buried in gateway config.
Forgetting cost guardrails. First runaway agent drains the monthly budget in 2 hours.
Single-tenant gateway shared across teams. Each team's bursty traffic affects others. Use workspaces or rate-limit per logical tenant.

How to choose a gateway

See our Best LLM Gateways in 2026 listicle for a detailed comparison. Quick decision framework:

Want gateway + observability + evals + prompts in one? → Respan
Need maximum model variety? → OpenRouter
Need open-source self-host? → LiteLLM
Need enterprise governance? → Portkey
Already on Cloudflare? → Cloudflare AI Gateway
Already on Vercel? → Vercel AI Gateway

FAQ

Do I need a gateway from day one? For toy projects, no — call providers directly. For production, yes — you'll need one within 6 months and migrating production traffic later is painful.

Does a gateway add latency? A well-designed gateway adds 5-15ms P95 overhead. With caching, the gateway often reduces median latency.

Can a gateway handle streaming? Yes. Any production-grade gateway supports streaming SSE end-to-end.

What's the difference between a gateway and a router? A gateway is the proxy + control plane. A router is one policy ("send small queries to cheap model") that runs inside the gateway.

What Is an LLM Gateway?

TL;DR

Why teams adopt a gateway

Core capabilities

Unified API

Provider fallback

Caching

Rate limiting and budgets

Observability integration

LLM gateway vs AI gateway

Common gateway mistakes

How to choose a gateway

FAQ

Related articles

8 Best LLM Gateways in 2026

What Is a RAG Pipeline?

What Is Agentic RAG?

Built for AI agents.
Break less.
Ship more.

What Is an LLM Gateway?

TL;DR

Why teams adopt a gateway

Core capabilities

Unified API

Provider fallback

Caching

Rate limiting and budgets

Observability integration

LLM gateway vs AI gateway

Common gateway mistakes

How to choose a gateway

FAQ

Related articles

8 Best LLM Gateways in 2026

What Is a RAG Pipeline?

What Is Agentic RAG?

Built for AI agents.
Break less.
Ship more.

Related articles

Best of
8 Best LLM Gateways in 2026
Best LLM gateways in 2026: Respan, OpenRouter, LiteLLM, Portkey, Cloudflare AI Gateway, Helicone, Bifrost, Vercel AI Gateway. Pricing, features, and when each is the right pick.
Frank Chen · 18 hours ago

Explainer
What Is a RAG Pipeline?
RAG pipeline explained: what it is, the components (chunking, embedding, retrieval, generation), common architectures, agentic RAG, and how to ship one in production.
Frank Chen · 18 hours ago

Explainer
What Is Agentic RAG?
Agentic RAG explained: how it differs from classic RAG, when to use it, the production architecture, and the tools that handle it well.
Frank Chen · 18 hours ago

What Is an LLM Gateway?

TL;DR

Why teams adopt a gateway

Core capabilities

Unified API

Provider fallback

Caching

Rate limiting and budgets

Observability integration

LLM gateway vs AI gateway

Common gateway mistakes

How to choose a gateway

FAQ

Related

Related articles

8 Best LLM Gateways in 2026

What Is a RAG Pipeline?

What Is Agentic RAG?

Built for AI agents. Break less. Ship more.

What Is an LLM Gateway?

TL;DR

Why teams adopt a gateway

Core capabilities

Unified API

Provider fallback

Caching

Rate limiting and budgets

Observability integration

LLM gateway vs AI gateway

Common gateway mistakes

How to choose a gateway

FAQ

Related

Related articles

8 Best LLM Gateways in 2026

What Is a RAG Pipeline?

What Is Agentic RAG?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.