An LLM gateway is a unified, OpenAI-compatible proxy in front of every model provider — OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, open-source endpoints. Your application makes one API call; the gateway handles routing, provider fallback, caching, rate limiting, cost guardrails, and observability so your application code stays simple.
For the comprehensive treatment, see our LLM Gateway pillar guide. This is the short version.
TL;DR
A gateway sits between your application and LLM providers. It does five things:
- Unified API — one OpenAI-compatible interface across 100s of models
- Provider fallback — when one provider has an outage, automatically retry on another
- Caching — exact-match and/or semantic, dramatically cuts cost on repeated queries
- Rate limiting and budgets — per user, per feature, per dollar
- Observability — every call captured with cost, latency, model, attribution
If you're running production LLM workloads past month 6, you need one.
Why teams adopt a gateway
Five forcing functions, in order:
- Provider outage. Anthropic returns 503s for 25 minutes. Without fallback, your support agent is dead. With a gateway, you fail over to Bedrock and users notice nothing.
- Cost spike. Some feature or user is consuming 10× the average. Without per-call attribution, you find out at month-end.
- Model evaluation. You want to test Claude vs GPT for a feature. Without a gateway, comparing means changing app code. With a gateway, it's a config change and a 5% traffic split.
- Rate limits become a tax. Provider limits are per-account. The gateway shapes traffic so no single feature starves the others.
- Compliance asks where prompts go. A gateway is the natural choke point for redaction, audit logs, regional routing.
Core capabilities
Unified API
Every modern gateway exposes the OpenAI Chat Completions API as the lingua franca. Your app uses openai-python or openai-node pointed at the gateway base URL. The gateway translates to whatever provider's native API on the back end.
from openai import OpenAI
client = OpenAI(
base_url="https://api.respan.ai/v1",
api_key=os.environ["RESPAN_API_KEY"],
)
# Now call any of 500+ models with the same client
client.chat.completions.create(
model="anthropic/claude-sonnet-4.6",
messages=[...],
)Switching to GPT-5.5 is a one-character change. Adding fallback to Azure is a header.
Provider fallback
On 5xx, timeout, or rate limit from the primary, the gateway retries against a configured secondary — usually the same model on a different cloud:
- Anthropic Sonnet → Bedrock Sonnet (same model, different cloud)
- OpenAI GPT-5.5 → Azure OpenAI GPT-5.5
- Google Gemini → Vertex AI Gemini
The fallback is transparent to your app. Same response shape, same streaming behavior.
Caching
Two flavors:
- Exact-match cache — hash request body, return prior response. Safe and high-hit for deterministic settings (temperature 0).
- Semantic cache — embed the prompt, match by similarity threshold. Higher hit rate, but the false-positive case ships stale answers. Enable carefully.
Caching can drop median latency to single-digit ms (cache hit returns immediately) and cut cost dramatically for repeated content.
Rate limiting and budgets
Provider rate limits are blunt — per-account, not per-feature. A gateway lets you set finer-grained budgets per user, per feature, per workspace, per minute, per day, per dollar. The right shape lets a "free tier" feature starve when over budget without killing your "premium" feature.
Observability integration
Standalone gateways send you logs. The value compounds when the gateway is part of a platform that also handles tracing, evals, and prompt management — every model call becomes a span in the trace, scored by your evaluators, attributed to its prompt version. See LLM Observability.
LLM gateway vs AI gateway
The two terms overlap heavily. "AI gateway" is the broader marketing term, often used by API-management vendors covering image, audio, embeddings beyond text LLMs. "LLM gateway" is narrower and product-team-focused. The capabilities overlap; pick whichever matches your stack.
Common gateway mistakes
- No fallback configured. It's half a gateway. Configure on day one.
- Semantic cache enabled by default. Wrong cache hit ships stale answers. Start with exact-match only.
- Treating the gateway as model-routing logic. Routing rules ("if query is short, use cheap model") belong in your app or a separate router service, not buried in gateway config.
- Forgetting cost guardrails. First runaway agent drains the monthly budget in 2 hours.
- Single-tenant gateway shared across teams. Each team's bursty traffic affects others. Use workspaces or rate-limit per logical tenant.
How to choose a gateway
See our Best LLM Gateways in 2026 listicle for a detailed comparison. Quick decision framework:
- Want gateway + observability + evals + prompts in one? → Respan
- Need maximum model variety? → OpenRouter
- Need open-source self-host? → LiteLLM
- Need enterprise governance? → Portkey
- Already on Cloudflare? → Cloudflare AI Gateway
- Already on Vercel? → Vercel AI Gateway
FAQ
Do I need a gateway from day one? For toy projects, no — call providers directly. For production, yes — you'll need one within 6 months and migrating production traffic later is painful.
Does a gateway add latency? A well-designed gateway adds 5-15ms P95 overhead. With caching, the gateway often reduces median latency.
Can a gateway handle streaming? Yes. Any production-grade gateway supports streaming SSE end-to-end.
What's the difference between a gateway and a router? A gateway is the proxy + control plane. A router is one policy ("send small queries to cheap model") that runs inside the gateway.
Should I build my own gateway? For most teams, no. Off-the-shelf gateways handle fallback, caching, rate limiting, and observability — building this in-house is months of work for a feature you can buy.