Gateway quickstart

Call 1000+ LLMs through one unified API with automatic tracing, prompts, fallbacks, and caching.

Respan’s AI Gateway is a single OpenAI-compatible endpoint that routes to 1000+ models across every major provider (OpenAI, Anthropic, Google, Bedrock, Azure, and more). Point any LLM SDK at Respan and you get automatic tracing, prompt management, fallbacks, load balancing, and caching, without changing how you call the model.

Respan AI Gateway schema: input flows through the LLM gateway to 1000+ LLMs, model fallback, load balancing, and prompt caching, producing optimized LLM output.

Considerations:

  • May not suit products with strict latency requirements (50 to 150ms added).
  • May not be ideal for those who don’t want to integrate a third-party service into the core of their application.

Setup

1. Get your Respan API key

Create an account on Respan, then generate a key on the API keys page.

Create API key

Environment management: create separate API keys for test and production instead of toggling an env parameter. Cleaner separation, better security.


2. Add credits to your Respan account

Top up on the Credits page. Respan uses your credits to call LLMs on your behalf, so you can reach any supported provider without managing individual provider keys.

Connect your own provider credentials instead of using credits. We use them to call LLMs on your behalf and never for anything else. Add keys on the Providers page. See Model providers for setup per provider.


3. Point your app at the Respan endpoint

Route any LLM SDK to https://api.respan.ai/api/ using your Respan API key. Pick the path that fits your workflow.

The fastest way. The Respan CLI hands off to your coding agent (Claude Code, Cursor, Codex, and others), which detects your project’s LLM provider and rewrites your existing calls to point at the gateway.

$npx @respan/cli setup

It also creates a Respan API key for you and saves it to .env if one isn’t already set.


4. Use prompts (Optional)

Manage prompt templates centrally instead of hardcoding them. Create and version a prompt in Respan, then reference it by prompt_id in your gateway calls to ship new versions without changing code.

1

Create a prompt

Go to the Prompts page and create a new prompt. Write your system message and user template with {{variables}}:

System: You are a helpful assistant that speaks {{language}}.
User: {{user_message}}

Save and deploy. Copy the prompt ID.

2

Use the prompt

Call the gateway with your prompt ID. Respan injects the prompt and fills in the variables:

$curl -X POST "https://api.respan.ai/api/chat/completions" \
> -H "Content-Type: application/json" \
> -H "Authorization: Bearer YOUR_RESPAN_API_KEY" \
> -d '{
> "model": "gpt-5.5",
> "prompt_id": "YOUR_PROMPT_ID",
> "variables": {
> "language": "Spanish",
> "user_message": "What is machine learning?"
> }
> }'

For versioning, deployment, and testing see Prompt management.


Next steps

You’re routing traffic through the gateway. Here’s what else it gives you.

What the gateway provides

  • One endpoint, 1000+ models A single OpenAI-compatible endpoint routes to every major provider: OpenAI, Anthropic, Google, Bedrock, Azure, and more.
  • A platform, not just a proxy Observability, evals, prompt management, monitors, and spend limits share the gateway’s data plane, with no second SDK or vendor.
  • Reliability Ordered model fallback, load balancing across deployments and providers, and configurable auto-retries with backoff.
  • Custom attributes Tag requests with customer_identifier and metadata, then break down latency, errors, and eval scores by them.
  • Caching Response caching with configurable TTL and per-customer scoping.
  • Observability Every call is traced end-to-end: model attempted, fallback fired, cache hit/miss, tokens, latency.
  • Evals & prompt management Run online and offline eval pipelines against production traffic, and manage versioned prompts referenced by ID.
  • Limits & monitors Soft and hard caps on requests or tokens, scoped per key, model, or customer, with Slack, email, or webhook alerts.

Examples

Stream tokens as they’re generated by passing stream=True.

1from openai import OpenAI
2
3client = OpenAI(
4 base_url="https://api.respan.ai/api/",
5 api_key="YOUR_RESPAN_API_KEY",
6)
7
8response = client.chat.completions.create(
9 model="gpt-5.5",
10 messages=[{"role": "user", "content": "Hello"}],
11 stream=True,
12)
13
14for chunk in response:
15 print(chunk)

Let the model call your functions by passing a tools array.

1from openai import OpenAI
2client = OpenAI(
3 base_url="https://api.respan.ai/api/",
4 api_key="YOUR_RESPAN_API_KEY",
5)
6
7tools = [
8 {
9 "type": "function",
10 "function": {
11 "name": "get_current_weather",
12 "description": "Get the current weather in a given location",
13 "parameters": {
14 "type": "object",
15 "properties": {
16 "location": {
17 "type": "string",
18 "description": "The city and state, e.g. San Francisco, CA",
19 },
20 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
21 },
22 "required": ["location"],
23 },
24 }
25 }
26]
27messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
28completion = client.chat.completions.create(
29 model="gpt-5.5",
30 messages=messages,
31 tools=tools,
32 tool_choice="auto"
33)
34print(completion)

Have supported models return their reasoning before the final answer by passing a thinking config.

1payload = {
2 "model": "claude-sonnet-4-5-20250929",
3 "max_tokens": 16000,
4 "thinking": {
5 "type": "enabled",
6 "budget_tokens": 10000
7 },
8 "messages": [
9 {
10 "role": "user",
11 "content": "Are there an infinite number of prime numbers such that n mod 4 == 3?"
12 }
13 ]
14}

Choose models that support thinking like gpt-5, claude-sonnet-4-5-20250929. See Log content types for details on the response structure.

Send PDFs as file content blocks for the model to read.

1import os
2import base64
3import requests
4from openai import OpenAI
5
6openai_client = OpenAI()
7pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
8response = requests.get(pdf_url)
9file_data = response.content
10file = openai_client.files.create(file=file_data, purpose="user_data")
11
12client = OpenAI(
13 base_url="https://api.respan.ai/api",
14 api_key=os.getenv("RESPAN_API_KEY"),
15)
16
17file_content = [
18 {"type": "text", "text": "What's this file about?"},
19 {
20 "type": "file",
21 "file": {
22 "file_id": file.id,
23 },
24 }
25]
26
27response = client.chat.completions.create(
28 model="gpt-5.5",
29 messages=[
30 {
31 "role": "user",
32 "content": file_content,
33 }
34 ],
35)

Pass images using image_url content blocks or via prompt variables.

1from openai import OpenAI
2
3client = OpenAI(
4 base_url="https://api.respan.ai/api/",
5 api_key="YOUR_RESPAN_API_KEY",
6)
7
8response = client.chat.completions.create(
9 model="gpt-5.5",
10 messages=[{"role": "user", "content": [
11 {"type": "text", "text": "What do you see?"},
12 {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}}
13 ]}],
14)