Does the gateway add latency to my requests?

Median overhead is under 50ms. For cached responses, total latency is typically under 20ms - faster than a direct provider call. For uncached requests, the gateway adds a small network hop, which is offset by the fallback and retry benefits.

Is it compatible with the OpenAI SDK?

Yes. You change base_url to the Respan endpoint and swap the API key. The SDK behaves identically - all existing method calls, parameters, and response shapes work as-is.

How does fallback routing work exactly?

You define a priority-ordered list of model/provider combinations. When the primary returns a 429, 5xx, or times out, the gateway routes the same request to the next in the chain - transparently, within the same HTTP call to your application.

Does semantic caching change the response?

Cached responses are returned verbatim from the original generation. Semantic matching determines if the cached response is applicable - you set the similarity threshold. Lower thresholds are more conservative.

Can I use it with LangChain or LlamaIndex?

Yes. Both frameworks accept a custom base_url on their OpenAI-compatible clients. Point them at the Respan endpoint and all routing, caching, and logging applies automatically.

How is this different from a load balancer?

A load balancer distributes traffic without understanding LLM semantics. Respan understands model capabilities, cost, error types, and response content. It routes intelligently - not just round-robin.

What happens if the Respan gateway goes down?

Respan runs with multi-region redundancy and 99.9% uptime. Enterprise plans include dedicated infrastructure and SLAs. You can also configure the SDK to fall back to direct provider calls if the gateway is unreachable.

Can I route to self-hosted or private models?

Yes. Any OpenAI-compatible endpoint - Ollama, vLLM, Bedrock, or a custom deployment - can be registered as a provider and included in routing rules.

AI Gateway

Reliable LLM access for teams running models in production - with automatic failover, load balancing, semantic caching, and cost controls built in.

Start free View docs

Trusted in production

<50ms

median gateway overhead

100+

LLM models supported

99.9%

uptime with failover

~35%

avg. cost reduction via routing + caching

What breaks when you call providers directly

Every team that runs LLMs in production eventually writes ad hoc versions of the same infrastructure. Here's what you're solving without a gateway.

✗ Provider outages become user-facing downtime

Without automatic failover, a single 429 or 5xx from your LLM provider stops every in-flight request. There's no fallback - users see errors.

✗ Rate limits spike at exactly the wrong time

Traffic spikes saturate a single API key. Without load balancing across keys or providers, your error rate climbs with your user count.

✗ Every request hits your most expensive model

Without routing rules, simple classification tasks go to GPT-4o. Without cost control, you have no way to route by complexity, cost, or capability.

✗ You pay to generate the same response thousands of times

High-traffic apps re-generate near-identical answers on every request. Without semantic caching, each one hits the model and costs tokens.

✗ Switching providers requires rewriting integrations

Each provider has different APIs, auth schemes, and error formats. Without an abstraction layer, you write adapter code for each and maintain all of it.

What you get

Every capability is active from the moment you change your base URL. No infrastructure to deploy.

→Route requests to 100+ LLMs through a single OpenAI-compatible endpoint - swap models by changing one parameter
→Define fallback chains: if provider A returns 429 or 5xx, route to provider B automatically within the same request
→Load balance across multiple API keys to distribute traffic and stay within rate limits
→Cache responses using semantic similarity - a near-identical question returns a cached result without hitting the model
→Configure retry logic with exponential backoff, configurable per-model and per-error-code
→Set token budgets and spending caps per user, team, or environment - enforced at the gateway
→Route by capability: vision requests go to vision-capable models, long context to models that support it
→Apply request policies (content filtering, max token caps, blocked topics) before requests leave your infrastructure
→Inspect every routing decision, cache hit, fallback, and retry in real-time request logs
→Add one line of code to an existing OpenAI integration - no other changes required

How it works

The gateway sits as a transparent proxy between your application and LLM providers. Your app sends requests to a single Respan endpoint using the same client you already have. Respan evaluates routing rules, checks the cache, applies policies, and forwards the request to the selected provider - all within the request lifecycle.

Send request

Your app sends a request to api.keywordsai.co using the OpenAI SDK. Only base_url changes.

Gateway evaluates

Routing rules, cache check, rate limits, and budget caps are evaluated in under 10ms.

Provider call

Request is forwarded to the selected provider. On failure, the fallback chain activates automatically.

Logged and returned

Response is returned to your app. Every routing decision, token count, and cost is logged.

import openai

# Before: calling OpenAI directly
client = openai.OpenAI(api_key="sk-...")

# After: route through Respan gateway
client = openai.OpenAI(
    api_key="YOUR_RESPAN_KEY",
    base_url="https://api.keywordsai.co/api/"
)

# Everything else stays the same.
# Routing, fallbacks, caching, and logging are now active.
response = client.chat.completions.create(
    model="gpt-4o",   # or "claude-3-5-sonnet", "gemini-1.5-pro", etc.
    messages=[{"role": "user", "content": "..."}]
)

Alt: Architecture diagram showing app → Respan gateway → multiple LLM providers with fallback arrows

Who uses this and how

Multi-tenant SaaS

If you're building a product with multiple customers: enforce per-customer token budgets so one noisy account doesn't drain your entire LLM budget. Block requests that exceed a monthly cap.

Agent pipelines

If you're running AI agents that can't afford to fail mid-task: define fallback chains per provider so a rate limit or outage doesn't interrupt an in-progress agent run.

High-volume chatbots

If you're handling thousands of similar queries: semantic caching returns cached responses for near-identical inputs, cutting token spend by 30–50% on repetitive workloads.

Model experiments

If you're evaluating whether to switch models: split production traffic between providers, collect live latency and cost data, and compare performance before committing.

Compliance environments

If you're operating under data policies: apply content filters and request policies at the gateway layer - consistently, before data reaches any provider.

Works with your stack

Model providers

OpenAI
Anthropic
Google Gemini / Vertex AI
Groq
Mistral
Together AI
Perplexity
Fireworks
Azure OpenAI
AWS Bedrock
Ollama (self-hosted)
Any OpenAI-compatible endpoint

Frameworks

OpenAI SDK (Python + Node)
LangChain
LlamaIndex
Vercel AI SDK
Mastra
CrewAI
AutoGen
OpenAI Agents SDK

Languages

Python
TypeScript / JavaScript
Go
Ruby
Any language via REST API

Direct-to-provider vs Respan Gateway

ConcernWithout gatewayWith Respan

Provider outageUser-facing error, manual interventionAutomatic failover to next provider in chain

Rate limitsCustom retry logic per providerBuilt-in load balancing across keys

Cost controlPost-hoc billing analysisPer-user/team budgets enforced in real-time

CachingBuild and maintain yourselfSemantic cache, configurable threshold, TTL

Switching modelsRewrite integration codeChange one request parameter

Policy enforcementApplication-level, inconsistentGateway-level, consistent across all callers

Why not just use provider APIs + a load balancer?

A load balancer distributes traffic without understanding LLM semantics. It doesn't know what a 429 means vs a 5xx, doesn't understand model capabilities, and can't make decisions based on cost, context length, or request content. You end up writing that logic yourself - and maintaining it per provider. Respan is that logic, already built and operated, plus observability on every decision it makes.

AI Gateway

Trusted in production

What breaks when you call providers directly

What you get

How it works

Who uses this and how

Works with your stack

Direct-to-provider vs Respan Gateway

Why not just use provider APIs + a load balancer?

Frequently asked questions

Frequently asked questions

Also in Respan

Built for AI agents.
Break less.
Ship more.

AI Gateway

Trusted in production

What breaks when you call providers directly

What you get

How it works

Who uses this and how

Works with your stack

Direct-to-provider vs Respan Gateway

Why not just use provider APIs + a load balancer?

Frequently asked questions

Frequently asked questions

Also in Respan

Built for AI agents.
Break less.
Ship more.

AI Gateway

Trusted in production

What breaks when you call providers directly

What you get

How it works

Who uses this and how

Works with your stack

Direct-to-provider vs Respan Gateway

Why not just use provider APIs + a load balancer?

Frequently asked questions

Frequently asked questions

Does the gateway add latency to my requests?

Is it compatible with the OpenAI SDK?

How does fallback routing work exactly?

Does semantic caching change the response?

Can I use it with LangChain or LlamaIndex?

How is this different from a load balancer?

What happens if the Respan gateway goes down?

Can I route to self-hosted or private models?

Also in Respan

Built for AI agents. Break less. Ship more.

AI Gateway

Trusted in production

What breaks when you call providers directly

What you get

How it works

Who uses this and how

Works with your stack

Direct-to-provider vs Respan Gateway

Why not just use provider APIs + a load balancer?

Frequently asked questions

Frequently asked questions

Does the gateway add latency to my requests?

Is it compatible with the OpenAI SDK?

How does fallback routing work exactly?

Does semantic caching change the response?

Can I use it with LangChain or LlamaIndex?

How is this different from a load balancer?

What happens if the Respan gateway goes down?

Can I route to self-hosted or private models?

Also in Respan

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.