Gateway quickstart | Respan Docs

Respan’s AI Gateway is a single OpenAI-compatible endpoint that routes to 1000+ models across every major provider (OpenAI, Anthropic, Google, Bedrock, Azure, and more). Point any LLM SDK at Respan and you get automatic tracing, prompt management, fallbacks, load balancing, and caching, without changing how you call the model.

Respan AI Gateway schema: input flows through the LLM gateway to 1000+ LLMs, model fallback, load balancing, and prompt caching, producing optimized LLM output.

Considerations:

May not suit products with strict latency requirements (50 to 150ms added).
May not be ideal for those who don’t want to integrate a third-party service into the core of their application.

Setup

1. Get your Respan API key

Create an account on Respan, then generate a key on the API keys page.

Environment management: create separate API keys for test and production instead of toggling an env parameter. Cleaner separation, better security.

2. Add credits to your Respan account

Top up on the Credits page. Respan uses your credits to call LLMs on your behalf, so you can reach any supported provider without managing individual provider keys.

Alternative: bring your own provider key

Connect your own provider credentials instead of using credits. We use them to call LLMs on your behalf and never for anything else. Add keys on the Providers page. See Model providers for setup per provider.

3. Point your app at the Respan endpoint

Route any LLM SDK to https://api.respan.ai/api/ using your Respan API key. Pick the path that fits your workflow.

AI (CLI)

Manually

The fastest way. The Respan CLI hands off to your coding agent (Claude Code, Cursor, Codex, and others), which detects your project’s LLM provider and rewrites your existing calls to point at the gateway.

$ npx @respan/cli setup

It also creates a Respan API key for you and saves it to .env if one isn’t already set.

4. Use prompts (Optional)

Manage prompt templates centrally instead of hardcoding them. Create and version a prompt in Respan, then reference it by prompt_id in your gateway calls to ship new versions without changing code.

Create a prompt

Go to the Prompts page and create a new prompt. Write your system message and user template with {{variables}}:

System: You are a helpful assistant that speaks {{language}}.
User: {{user_message}}

Save and deploy. Copy the prompt ID.

Use the prompt

Call the gateway with your prompt ID. Respan injects the prompt and fills in the variables:

$ curl -X POST "https://api.respan.ai/api/chat/completions" \
>   -H "Content-Type: application/json" \
>   -H "Authorization: Bearer YOUR_RESPAN_API_KEY" \
>   -d '{
>     "model": "gpt-5.5",
>     "prompt_id": "YOUR_PROMPT_ID",
>     "variables": {
>       "language": "Spanish",
>       "user_message": "What is machine learning?"
>     }
>   }'

For versioning, deployment, and testing see Prompt management.

Next steps

You’re routing traffic through the gateway. Here’s what else it gives you.

What the gateway provides

One endpoint, 1000+ models A single OpenAI-compatible endpoint routes to every major provider: OpenAI, Anthropic, Google, Bedrock, Azure, and more.
A platform, not just a proxy Observability, evals, prompt management, monitors, and spend limits share the gateway’s data plane, with no second SDK or vendor.
Reliability Ordered model fallback, load balancing across deployments and providers, and configurable auto-retries with backoff.
Custom attributes Tag requests with customer_identifier and metadata, then break down latency, errors, and eval scores by them.
Caching Response caching with configurable TTL and per-customer scoping.
Observability Every call is traced end-to-end: model attempted, fallback fired, cache hit/miss, tokens, latency.
Evals & prompt management Run online and offline eval pipelines against production traffic, and manage versioned prompts referenced by ID.
Limits & monitors Soft and hard caps on requests or tokens, scoped per key, model, or customer, with Slack, email, or webhook alerts.

Examples

Streaming

Stream tokens as they’re generated by passing stream=True.

1 from openai import OpenAI
2 
3 client = OpenAI(
4     base_url="https://api.respan.ai/api/",
5     api_key="YOUR_RESPAN_API_KEY",
6 )
7 
8 response = client.chat.completions.create(
9     model="gpt-5.5",
10     messages=[{"role": "user", "content": "Hello"}],
11     stream=True,
12 )
13 
14 for chunk in response:
15     print(chunk)

Function calling

Let the model call your functions by passing a tools array.

1 from openai import OpenAI
2 client = OpenAI(
3     base_url="https://api.respan.ai/api/",
4     api_key="YOUR_RESPAN_API_KEY",
5 )
6 
7 tools = [
8   {
9     "type": "function",
10     "function": {
11       "name": "get_current_weather",
12       "description": "Get the current weather in a given location",
13       "parameters": {
14         "type": "object",
15         "properties": {
16           "location": {
17             "type": "string",
18             "description": "The city and state, e.g. San Francisco, CA",
19           },
20           "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
21         },
22         "required": ["location"],
23       },
24     }
25   }
26 ]
27 messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
28 completion = client.chat.completions.create(
29   model="gpt-5.5",
30   messages=messages,
31   tools=tools,
32   tool_choice="auto"
33 )
34 print(completion)

Enable thinking

Have supported models return their reasoning before the final answer by passing a thinking config.

1 payload = {
2     "model": "claude-sonnet-4-5-20250929",
3     "max_tokens": 16000,
4     "thinking": {
5         "type": "enabled",
6         "budget_tokens": 10000
7     },
8     "messages": [
9         {
10             "role": "user",
11             "content": "Are there an infinite number of prime numbers such that n mod 4 == 3?"
12         }
13     ]
14 }

Choose models that support thinking like gpt-5, claude-sonnet-4-5-20250929. See Log content types for details on the response structure.

Upload PDF

Send PDFs as file content blocks for the model to read.

1 import os
2 import base64
3 import requests
4 from openai import OpenAI
5 
6 openai_client = OpenAI()
7 pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
8 response = requests.get(pdf_url)
9 file_data = response.content
10 file = openai_client.files.create(file=file_data, purpose="user_data")
11 
12 client = OpenAI(
13     base_url="https://api.respan.ai/api",
14     api_key=os.getenv("RESPAN_API_KEY"),
15 )
16 
17 file_content = [
18     {"type": "text", "text": "What's this file about?"},
19     {
20         "type": "file",
21         "file": {
22             "file_id": file.id,
23         },
24     }
25 ]
26 
27 response = client.chat.completions.create(
28     model="gpt-5.5",
29     messages=[
30         {
31             "role": "user",
32             "content": file_content,
33         }
34     ],
35 )

Upload image

Pass images using image_url content blocks or via prompt variables.

1 from openai import OpenAI
2 
3 client = OpenAI(
4     base_url="https://api.respan.ai/api/",
5     api_key="YOUR_RESPAN_API_KEY",
6 )
7 
8 response = client.chat.completions.create(
9     model="gpt-5.5",
10     messages=[{"role": "user", "content": [
11         {"type": "text", "text": "What do you see?"},
12         {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}}
13     ]}],
14 )