In section 1 you saw a direct call from your application to OpenAI. That works for a prototype. For anything you ship to real users, every LLM call should go through a gateway.

What is an LLM gateway

A gateway is a single URL that sits between your application and every LLM provider. Instead of your code calling OpenAI directly, your code calls the gateway, and the gateway calls OpenAI on your behalf.

Your app  →  Gateway  →  OpenAI / Anthropic / Gemini / 250+ models
                ↓
           logs, cache, fallback, redaction

Why this is worth doing:

One place to log every call. Every prompt, response, model used, latency, and token count is recorded in one searchable list, not spread across provider dashboards.
One place to switch models. When the next frontier model ships, you change one string in your config. Every call route updates.
Fallbacks: if OpenAI is down, the gateway automatically tries Anthropic. If Claude rate-limits you, the gateway routes to Gemini. Your customers see no outage.
Caching: if the same question (or same context window) was asked recently, the gateway returns the previous response instead of paying for a new generation. Latency goes from seconds to milliseconds; cost drops.
Per-customer cost caps: you can hard-cap how many tokens any single customer can consume per day, at the gateway, not in every call site in your code.
PII redaction: customer data can be redacted before it ever reaches the model provider. Important for HIPAA, SOC 2, and similar compliance regimes.

The gateway is one of the highest-leverage architectural decisions in this whole guide. Most production teams add it on day one.

A gateway implementation

Respan's gateway is a drop-in for the OpenAI client. Two characters change:

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.respan.ai/api/",
    api_key="YOUR_RESPAN_API_KEY",
)
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

What changed: base_url now points at the Respan gateway. The api_key is your Respan API key (Respan forwards to OpenAI using a provider key you configured separately on the Providers page).

What you got: every call gets logged automatically. You can see every request and response in the dashboard.

Switching providers

Same client, different model name. The gateway routes by model:

# OpenAI
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
)
 
# Anthropic
response = client.chat.completions.create(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Hello!"}],
)
 
# Gemini
response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Hello!"}],
)

When a new model ships, you change one string. No SDK swap, no refactor.

Caching

Per-call cache control lets you mark which calls are safe to serve from cache:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me about your return policy"}],
    extra_body={
        "cache_enabled": True,
        "cache_ttl": 600,
        "cache_options": {"cache_by_customer": True},
    },
)

If the same prompt was asked in the last 600 seconds, the gateway returns the cached response. cache_by_customer: True ensures cache is keyed per customer, so customer A's response never leaks to customer B even if they happened to ask the same question.

For an FAQ-shaped chatbot, caching alone often cuts cost 30-50%.

Fallbacks and redaction

These typically live in your gateway configuration (not in every call site). On Respan, you configure them in the provider settings and they apply to every call. The point of putting them at the gateway is that they are uniform across your application: you cannot accidentally bypass them by calling OpenAI directly.

What about latency

A gateway adds 50-150ms to every call (network hop + processing). For most products that is invisible. For real-time voice agents where every millisecond matters, you may want to direct-call OpenAI for the latency-critical path and route everything else through the gateway. Most teams accept the latency.

What you have at the end of section 2

Every LLM call routes through a single URL with logging, fallbacks, caching, and cost caps.
Switching models or providers is a one-string change.
You have an audit trail per call without writing any logging code.

Next: prompts as artifacts

The next section, Designing and managing prompts, covers the next failure mode: prompts in source files. They look fine until you have eight of them and a non-engineer asks why the agent's tone changed last Tuesday.

Or back to the Chapter 1 hub.

In section 1 you saw a direct call from your application to OpenAI. That works for a prototype. For anything you ship to real users, every LLM call should go through a gateway.

What is an LLM gateway

Your app  →  Gateway  →  OpenAI / Anthropic / Gemini / 250+ models
                ↓
           logs, cache, fallback, redaction

Why this is worth doing:

One place to log every call. Every prompt, response, model used, latency, and token count is recorded in one searchable list, not spread across provider dashboards.
One place to switch models. When the next frontier model ships, you change one string in your config. Every call route updates.
Fallbacks: if OpenAI is down, the gateway automatically tries Anthropic. If Claude rate-limits you, the gateway routes to Gemini. Your customers see no outage.
Caching: if the same question (or same context window) was asked recently, the gateway returns the previous response instead of paying for a new generation. Latency goes from seconds to milliseconds; cost drops.
Per-customer cost caps: you can hard-cap how many tokens any single customer can consume per day, at the gateway, not in every call site in your code.
PII redaction: customer data can be redacted before it ever reaches the model provider. Important for HIPAA, SOC 2, and similar compliance regimes.

The gateway is one of the highest-leverage architectural decisions in this whole guide. Most production teams add it on day one.

A gateway implementation

Respan's gateway is a drop-in for the OpenAI client. Two characters change:

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.respan.ai/api/",
    api_key="YOUR_RESPAN_API_KEY",
)
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

What changed: base_url now points at the Respan gateway. The api_key is your Respan API key (Respan forwards to OpenAI using a provider key you configured separately on the Providers page).

What you got: every call gets logged automatically. You can see every request and response in the dashboard.

Switching providers

Same client, different model name. The gateway routes by model:

# OpenAI
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
)
 
# Anthropic
response = client.chat.completions.create(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Hello!"}],
)
 
# Gemini
response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Hello!"}],
)

When a new model ships, you change one string. No SDK swap, no refactor.

Caching

Per-call cache control lets you mark which calls are safe to serve from cache:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me about your return policy"}],
    extra_body={
        "cache_enabled": True,
        "cache_ttl": 600,
        "cache_options": {"cache_by_customer": True},
    },
)

For an FAQ-shaped chatbot, caching alone often cuts cost 30-50%.

Fallbacks and redaction

What about latency

What you have at the end of section 2

Every LLM call routes through a single URL with logging, fallbacks, caching, and cost caps.
Switching models or providers is a one-string change.
You have an audit trail per call without writing any logging code.

Next: prompts as artifacts

Or back to the Chapter 1 hub.

Calling LLMs in Production

What is an LLM gateway

A gateway implementation

Switching providers

Caching

Fallbacks and redaction

What about latency

What you have at the end of section 2

Next: prompts as artifacts

Built for AI agents.
Break less.
Ship more.

Calling LLMs in Production

What is an LLM gateway

A gateway implementation

Switching providers

Caching

Fallbacks and redaction

What about latency

What you have at the end of section 2

Next: prompts as artifacts

Built for AI agents.
Break less.
Ship more.

Calling LLMs in Production

What is an LLM gateway

A gateway implementation

Switching providers

Caching

Fallbacks and redaction

What about latency

What you have at the end of section 2

Next: prompts as artifacts

Built for AI agents. Break less. Ship more.

Calling LLMs in Production

What is an LLM gateway

A gateway implementation

Switching providers

Caching

Fallbacks and redaction

What about latency

What you have at the end of section 2

Next: prompts as artifacts

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.