Reliability | Respan Docs

For the complete list of all request parameters, see API reference.

Fallback models

Respan catches any errors occurring in a request and falls back to the list of models you specified in the fallback_models field. This is useful to avoid downtime and ensure availability.

Via UI

OpenAI Python SDK

OpenAI TypeScript SDK

Standard API

Go to Settings -> Fallback -> Click on Add fallback models -> Select the models you want to add as fallbacks.

You can drag and drop the models to reorder them. The order of the models in the list is the order in which they will be tried.

Load balancing

Load balancing allows you to balance the request load across different deployments. You can specify weights for each deployment based on their rate limit and your preference.

Load balancing between models

Go to the Load balancing page

Go to the Load balancing page and click on Create new load balancer

Add models

Click Add model to add models and specify the weight for each model and add your own credentials.

Copy group ID to your codebase

After you have added the models, copy the group ID (the blue text) to your codebase and use it in your requests.

The model parameter will overwrite the load_balance_group!

1 {
2     "messages": [
3         {
4             "role": "user",
5             "content": "Hi, how are you?"
6         }
7     ],
8     "load_balance_group": {
9         "group_id":"THE_GROUP_ID"
10     }
11 }

Add load balancing group in code (Optional)

You can also add the load balancing group in your codebase directly. The models field will overwrite the load_balance_group you specified in the UI.

Example code

1 {
2   "load_balance_group": {
3       "group_id":"THE_GROUP_ID",
4       "models": [
5         {
6           "model": "azure/gpt-35-turbo",
7           "weight": 1
8         },
9         {
10           "model": "azure/gpt-4",
11           "credentials": {
12               "api_base": "Your own Azure api_base",
13               "api_version": "Your own Azure api_version",
14               "api_key": "Your own Azure api_key"
15           },
16           "weight": 1
17         }
18       ]
19   }
20 }

Load balancing between deployments

A deployment basically means a credential. If you add an OpenAI API key, you have one deployment. If you add 2 OpenAI API keys, you have 2 deployments.

You can go to the platform and add multiple deployments for the same provider, specifying load balancing weights for each deployment.

You can also load balance between deployments in your codebase using the customer_credentials field:

1 {
2   "customer_credentials": [
3     {
4         "credentials": {
5             "openai": {
6                 "api_key": "YOUR_OPENAI_API_KEY",
7             }
8         },
9         "weight": 1.0
10     },
11     {
12         "credentials": {
13             "openai": {
14                 "api_key": "YOUR_OPENAI_API_KEY",
15             }
16         },
17         "weight": 1.0
18     },
19   ],
20 }

Specify available models

You can specify the available models for load balancing. For example, if you only want to use gpt-3.5-turbo in an OpenAI deployment, specify it in the available_models field or do it in the platform.

Learn more about how to specify available models in the platform here.

1 {
2   "customer_credentials": [
3     {
4         "credentials": {
5             "openai": {
6                 "api_key": "YOUR_OPENAI_API_KEY",
7             }
8         },
9         "weight": 1.0,
10         "available_models": ["gpt-3.5-turbo"],
11         "exclude_models": ["gpt-4"]
12     },
13     {
14         "credentials": {
15             "openai": {
16                 "api_key": "YOUR_OPENAI_API_KEY",
17             }
18         },
19         "weight": 1.0,
20     },
21   ],
22 }

Retries

When an LLM call fails, the system detects the error and retries the request to prevent failovers.

Via UI

Via code

Go to the Retries page and enable retries and set the number of retries and the initial retry time.

Supported parameters

Something went wrong!

Automatic retry logic

Respan will automatically retry failed requests if the failure is a rate limit issue from the upstream provider:

1 model # User requested model
2 model_params = respan_models_data[model]
3 # Exponential backoff retry logic
4 for i in range(0, fallback_retries):
5     try:
6         response = respan_response_with_load_balance(model)
7         return response
8         break
9     except RateLimitError:
10         if model_params["fallback_models"]:
11             for fallback_model in model_params["fallback_models"]:
12                 response = respan_response_with_load_balance(fallback_model)
13                 return response
14         sleep(2 ** i)
15     except Exception as e:
16         raise e

Inline router (`models`)

Pass an inline list of candidate models per request and let the LLM router pick one. This is the request-time alternative to a pre-configured load_balance_group.

1 response = client.chat.completions.create(
2     model="gpt-4o",
3     messages=[{"role": "user", "content": "Hello"}],
4     extra_body={
5         "models": ["gpt-4o", "claude-sonnet-4-20250514", "gemini-2.5-flash"],
6         "exclude_providers": ["azure"],
7     },
8 )

Field	Type	Description
`models`	array	Candidate model names. Router selects one per request.
`exclude_models`	array	Models to exclude from selection.
`exclude_providers`	array	Providers to exclude from selection.

The selected model is recorded on the log under model.

Per-model credential override (`credential_override`)

Override credentials for a specific model on a single request. More granular than customer_credentials (which applies per provider). Useful when one model in a fallback chain needs different credentials than the rest:

1 response = client.chat.completions.create(
2     model="azure/gpt-4o",
3     messages=[{"role": "user", "content": "Hello"}],
4     extra_body={
5         "credential_override": {
6             "azure/gpt-4o": {
7                 "api_key": "...",
8                 "api_base": "https://your-azure-resource.openai.azure.com/",
9                 "api_version": "2024-08-01-preview",
10             },
11         },
12     },
13 )

The key is the full model slug (azure/gpt-4o, vertex_ai/claude-sonnet-4-5@20250929, etc.). Each fallback attempt resolves credentials per-model.

Fallback models

Via UI

OpenAI Python SDK

OpenAI TypeScript SDK

Standard API

Load balancing

Load balancing between models

Go to the Load balancing page

Add models

Copy group ID to your codebase

Add load balancing group in code (Optional)

Example code

Load balancing between deployments

Specify available models

Retries

Via UI

Via code

Supported parameters

Automatic retry logic

Inline router (models)

Per-model credential override (credential_override)

Inline router (`models`)

Per-model credential override (`credential_override`)