TL;DR
An LLM gateway is a unified, OpenAI-compatible proxy in front of every model provider — OpenAI, Anthropic, Google, Bedrock, Azure, open-source endpoints. It handles routing, provider fallback, caching, rate limiting, cost guardrails, and observability so your application code stays simple. Most teams realize they need one around month 6 of production; the smart ones adopt it on day one.
What is an LLM gateway?
An LLM gateway is a single API that sits between your application and every LLM provider you might call — OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, Mistral, fine-tuned and open-source endpoints. Your app code makes one request in OpenAI-compatible format; the gateway dispatches to the chosen provider, handles retries and fallback, applies caching and rate limits, attaches observability, and returns the response.
The pattern borrows from API gateways in microservices (Kong, Apigee, Tyk) but specializes for LLM workloads: streaming, token accounting, model-aware routing, and the eccentricities of each provider's API.
Why teams adopt a gateway
Five forcing functions, in roughly the order teams hit them:
- A provider has an outage. Anthropic returns 503s for 20 minutes. Without fallback, your customer support agent is dead. With a gateway, you fail over to Bedrock Sonnet and users notice nothing.
- Cost spikes mysteriously. Some feature or some user is consuming 10× the average. Without per-call attribution and budget caps, you find out at month-end. Gateways enforce per-feature, per-user budgets in real time.
- You want to test a new model. A new Claude variant ships. Without abstraction, comparing it to your current model means changing app code. With a gateway, it's a config change and a 5% traffic split.
- Rate limits become a tax. Provider rate limits are per-account, not per-feature. The gateway lets you shape traffic so no single feature starves the others.
- Compliance asks where prompts are going. A gateway is the natural choke point for redaction, audit logs, and regional routing.

The case for a gateway is risk management more than abstraction. Provider downtime is the silent killer — Anthropic 503s in a customer support agent destroy user experience without even paging on-call, because your app technically returns 200s through retries until it finally gives up. The right time to add a gateway is before the outage, not after.
The teams I've watched ship without one all hit the same milestone: a 30-minute provider degradation that costs them more than 3 years of gateway licensing. If you're past month 6 in production and still calling providers directly, you're shipping single-provider lock-in to your customers — and they're paying for it the next time the model burps.
Teams routing through Respan in production
Core capabilities
Unified API (OpenAI-compatible)
Every modern gateway exposes the OpenAI Chat Completions API as the lingua franca. Your app code uses openai-python or openai-node pointed at the gateway base URL. The gateway translates to whatever provider's native API on the back end. This is the abstraction that pays back every time you swap models.
Provider fallback
On a primary-provider error (5xx, timeout, rate limit), the gateway retries against a configured secondary. The flow:
Anthropic 503 → Bedrock takes over. Your app sees a successful 200. The detour adds a small amount of latency in the failure case; cache hits on the happy path return in single-digit milliseconds.
Common pairings:
- Anthropic Sonnet → Bedrock Sonnet (same model, different cloud)
- OpenAI GPT-4o → Azure OpenAI GPT-4o
- Google Gemini Pro → Vertex AI Gemini Pro
The fallback should be transparent to your app — same response shape, same streaming behavior. If your gateway requires app code changes for fallback, it's not really a gateway.
Caching
Two flavors:
- Exact-match cache: hash the request body, return the prior response. Safe and high-hit for deterministic settings (temperature 0, fixed seed).
- Semantic cache: embed the prompt, match by similarity threshold. Higher hit rate, but the false-positive case is shipping a stale answer to a slightly-different question. Enable carefully.
Rate limiting and budgets
Provider rate limits are blunt. A gateway lets you set finer-grained budgets — per user, per feature, per workspace, per minute, per day, per dollar. The right shape lets a "free tier" feature starve when over budget without killing your "premium" feature.
Observability + evals integration
Standalone gateways send you logs, but the value compounds when the gateway is part of a platform that also handles tracing, observability, and evals. Every model call becomes a span in the trace, scored by your evaluators, attributed to its prompt version. Respan's gateway is built around this.
How to instrument
OpenAI-compatible drop-in. The total app-code change to route through Respan's gateway:
from openai import OpenAI
client = OpenAI(
base_url="https://api.respan.ai/v1",
api_key=os.environ["RESPAN_API_KEY"],
)
# Now call any of 500+ models with the same client
client.chat.completions.create(
model="anthropic/claude-3-5-sonnet",
messages=[{"role": "user", "content": "..."}],
extra_headers={
"x-respan-feature": "support_agent",
"x-respan-fallback": "bedrock/anthropic.claude-3-5-sonnet",
},
)Switching to GPT-4o is a one-character change. Adding fallback to Azure is a header. Routing 5% of traffic to a fine-tune for an A/B test is a config change.
What fallback actually saves you
Provider outages happen on a regular cadence — every major LLM API has had multi-minute regional capacity events in the last twelve months. With a gateway and configured fallback, your customer-support agent stays on Bedrock Sonnet while Anthropic Sonnet is returning 5xx, your app sees zero failed requests, and the incident shows up in your post-mortem only because someone noticed the spike in fallback rate on a dashboard.
The user impact in those windows is usually a small bump in median latency from the retry hop and nothing else. Total engineer time spent reacting: zero. Total customer impact: invisible. That's the case for the gateway in one sentence.
The teams that don't have a gateway live the inverse story every quarter. The math doesn't get better as you add providers; the entire industry's effective uptime becomes whichever provider you've coupled to.
Common gateway mistakes
- No fallback configured. The gateway is half a gateway. Fix on day one.
- Semantic cache enabled by default. The wrong cache hit ships a stale or wrong answer. Start with exact-match only; turn on semantic only after you understand your false-positive tolerance.
- Treating the gateway as the place to do model routing logic. Routing rules ("if query is short, use cheap model") belong in your app or a separate router service — not buried in gateway config nobody reads.
- Forgetting cost guardrails. The first runaway agent that makes a recursive tool call drains the monthly budget in 2 hours.
- Single-tenant gateway shared across teams. Each team's bursty traffic affects the others. Use workspaces or rate-limit per logical tenant.
LLM gateways compared
Six options most teams evaluate. Best read as of May 2026.
| Tool | Models supported | OpenAI-compatible | Provider fallback | Caching | Rate limiting | Cost guardrails | Self-host | Observability |
|---|---|---|---|---|---|---|---|---|
| Respan Gateway | 500+ | Yes | Yes | Yes | Yes | Yes | Yes (Enterprise) | Built-in |
| OpenRouter | 300+ | Yes | No | No | Yes | No | No | Standalone |
| LiteLLM | 100+ | Yes | Yes | Yes | Yes | Partial | Yes (OSS) | Standalone |
| Portkey | 250+ | Yes | Yes | Yes | Yes | Yes | Enterprise only | Built-in |
| Cloudflare AI Gateway | 50+ | Partial | Yes | Yes | Yes | Yes | No | Basic |
| Helicone | 100+ | Yes | Yes | Yes | Yes | No | Yes (OSS) | Built-in |
Frequently asked questions

Head of DevRel at Respan (YC W24). Working alongside the team running the infrastructure that handles 80M+ LLM requests a day.
Connect on LinkedIn →Route 500+ models through one API
OpenAI-compatible drop-in. Provider fallback, caching, cost guardrails, and full observability — all in one.
Related guides: LLM observability · LLM tracing · Gateway in Respan