Reliable LLM access for teams running models in production - with automatic failover, load balancing, semantic caching, and cost controls built in.
<50ms
median gateway overhead
100+
LLM models supported
99.9%
uptime with failover
~35%
avg. cost reduction via routing + caching
Every team that runs LLMs in production eventually writes ad hoc versions of the same infrastructure. Here's what you're solving without a gateway.
✗ Provider outages become user-facing downtime
Without automatic failover, a single 429 or 5xx from your LLM provider stops every in-flight request. There's no fallback - users see errors.
✗ Rate limits spike at exactly the wrong time
Traffic spikes saturate a single API key. Without load balancing across keys or providers, your error rate climbs with your user count.
✗ Every request hits your most expensive model
Without routing rules, simple classification tasks go to GPT-4o. Without cost control, you have no way to route by complexity, cost, or capability.
✗ You pay to generate the same response thousands of times
High-traffic apps re-generate near-identical answers on every request. Without semantic caching, each one hits the model and costs tokens.
✗ Switching providers requires rewriting integrations
Each provider has different APIs, auth schemes, and error formats. Without an abstraction layer, you write adapter code for each and maintain all of it.
Every capability is active from the moment you change your base URL. No infrastructure to deploy.
The gateway sits as a transparent proxy between your application and LLM providers. Your app sends requests to a single Respan endpoint using the same client you already have. Respan evaluates routing rules, checks the cache, applies policies, and forwards the request to the selected provider - all within the request lifecycle.
Send request
Your app sends a request to api.keywordsai.co using the OpenAI SDK. Only base_url changes.
Gateway evaluates
Routing rules, cache check, rate limits, and budget caps are evaluated in under 10ms.
Provider call
Request is forwarded to the selected provider. On failure, the fallback chain activates automatically.
Logged and returned
Response is returned to your app. Every routing decision, token count, and cost is logged.
import openai
# Before: calling OpenAI directly
client = openai.OpenAI(api_key="sk-...")
# After: route through Respan gateway
client = openai.OpenAI(
api_key="YOUR_RESPAN_KEY",
base_url="https://api.keywordsai.co/api/"
)
# Everything else stays the same.
# Routing, fallbacks, caching, and logging are now active.
response = client.chat.completions.create(
model="gpt-4o", # or "claude-3-5-sonnet", "gemini-1.5-pro", etc.
messages=[{"role": "user", "content": "..."}]
)Alt: Architecture diagram showing app → Respan gateway → multiple LLM providers with fallback arrows
If you're building a product with multiple customers: enforce per-customer token budgets so one noisy account doesn't drain your entire LLM budget. Block requests that exceed a monthly cap.
If you're running AI agents that can't afford to fail mid-task: define fallback chains per provider so a rate limit or outage doesn't interrupt an in-progress agent run.
If you're handling thousands of similar queries: semantic caching returns cached responses for near-identical inputs, cutting token spend by 30–50% on repetitive workloads.
If you're evaluating whether to switch models: split production traffic between providers, collect live latency and cost data, and compare performance before committing.
If you're operating under data policies: apply content filters and request policies at the gateway layer - consistently, before data reaches any provider.
Model providers
Frameworks
Languages
A load balancer distributes traffic without understanding LLM semantics. It doesn't know what a 429 means vs a 5xx, doesn't understand model capabilities, and can't make decisions based on cost, context length, or request content. You end up writing that logic yourself - and maintaining it per provider. Respan is that logic, already built and operated, plus observability on every decision it makes.