OpenAI API bills compound faster than any other line item in a modern AI product. A retrieval pipeline that costs $400/month in week one regularly hits $40k/month by quarter four. The token volume doesn't grow linearly with users; it grows with prompt length, retry rate, agent recursion depth, and how aggressively features ship.
Good news: the levers for cutting cost in 2026 are well understood, mostly stacking, and mostly free to enable. This post is the practical playbook. Ten tactics, ordered by impact-to-effort, with code where it matters.
If you take one thing from this post: the biggest cost wins come from a) prompt caching, b) model right-sizing, and c) putting a gateway in front of OpenAI so you can measure and route. Everything else compounds on top.
TL;DR
- Turn on prompt caching. Identical prompt prefixes get billed at roughly 10% of the input rate. Biggest single lever for chat workloads.
- Use the Batch API. Async workloads pay roughly 50% of the live rate.
- Right-size the model.
gpt-5.4-nanohandles a surprising share of production traffic at a fraction ofgpt-5.5cost. - Cap output tokens. Set
max_tokens. Always. - Minify prompts. Strip duplicate instructions, redundant whitespace, and noise from your system prompt.
- Semantic cache at the gateway. Repeat user questions don't need a fresh model call.
- Use structured outputs. Schema-bound responses cut retry rates from 5 to 10% down to under 1%.
- Stream selectively. Streaming is great UX, but cancel early when the user stops reading.
- Fine-tune only when math says so.
gpt-5.4-nanois usually cheaper than fine-tuning unless volumes are huge. - Alert on cost spikes per feature. What you can't attribute, you can't control.
Tactic 1: Turn on prompt caching
OpenAI's prompt caching automatically discounts the cached portion of an input prompt. The discount is steep: cached input tokens are billed at roughly 10% of the regular input rate (a 90% reduction on the cached portion). It applies automatically when a prompt prefix exceeds the minimum length and the prefix has been seen recently.
The catch: caching is prefix-based. Your prompt has to be byte-for-byte identical from position zero up to the cached length. That means:
- Put your stable system prompt at the very top.
- Put dynamic content (user message, retrieved context) at the bottom.
- Avoid timestamps, randomized IDs, or anything that perturbs the prefix.
# Wrong: dynamic content first ruins caching
messages = [
{"role": "system", "content": f"Current time: {now}\n{stable_instructions}"},
{"role": "user", "content": user_input},
]
# Right: stable prefix, then dynamic content
messages = [
{"role": "system", "content": stable_instructions},
{"role": "system", "content": f"Current time: {now}"},
{"role": "user", "content": user_input},
]For long stable prompts (RAG with a static knowledge base, agentic system prompts with tool descriptions), prompt caching alone cuts chat-completions cost by 50 to 70%.
Tactic 2: Use the Batch API for async workloads
Anything that doesn't need a sub-second response should run on the Batch API. The discount is roughly 50% off the live rate, and the SLA is "complete within 24 hours" (usually much faster in practice). Fits: nightly evals, bulk classification, embedding backfills, batch content generation.
# Build a JSONL file of requests and submit:
with open("requests.jsonl", "w") as f:
for item in items:
f.write(json.dumps({
"custom_id": item["id"],
"method": "POST",
"url": "/v1/chat/completions",
"body": {"model": "gpt-5.4-mini", "messages": [...], "max_tokens": 500}
}) + "\n")
file_obj = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
input_file_id=file_obj.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)Most teams discover after their first eval cycle that 30 to 40% of their token spend was on workloads that could have been batch jobs. Cutting that in half is a 15 to 20% bill reduction with one code change.
Tactic 3: Right-size the model
This is the largest cost lever after caching. The default instinct is to ship on the flagship (gpt-5.5 in 2026), but most production traffic doesn't need flagship capability.
The 2026 model ladder, cheapest to most expensive:
- gpt-5.4-nano: routing, classification, simple Q&A, structured extraction. Often single-digit cents per million input tokens.
- gpt-5.4-mini: general chat, RAG with quality requirements, light reasoning.
- gpt-5.4: harder reasoning, agentic workflows, code generation.
- gpt-5.5: the hardest reasoning tasks, complex tool use, anything where quality justifies the cost premium.
The rule of thumb: start every new feature on gpt-5.4-nano. Measure quality with evals. Step up only when evals demonstrate the cheaper model is insufficient. This inverts the usual flow (start at flagship, downgrade later) and routinely produces 10x cost differences.
A simple router pattern:
def select_model(query):
if is_simple_classification(query):
return "gpt-5.4-nano"
if is_general_chat(query):
return "gpt-5.4-mini"
if requires_reasoning(query):
return "gpt-5.4"
return "gpt-5.5" # default for hard casesPush this routing into your gateway and you can change ratios via config instead of deploys. See LLM Gateway: The Complete Guide.
Tactic 4: Cap output tokens
max_tokens is the most underused parameter in the OpenAI SDK. Without it:
- Verbose models ramble.
- Buggy prompts produce 4,000-token responses where 200 would do.
- Cost is unbounded on a per-request basis.
Set max_tokens on every call. Pick a value 2x your expected output length, not 10x. For most classification, summarization, and structured-output tasks, that's 100 to 500 tokens.
client.chat.completions.create(
model="gpt-5.4-mini",
messages=messages,
max_tokens=400, # always set this
temperature=0.3,
)Bonus: Azure PTU deployments use max_tokens to estimate utilization. Tight values let more concurrent requests through the same PTU allocation.
Tactic 5: Minify prompts
System prompts grow like git histories: every bug fix adds three lines, nobody removes any. By month nine the prompt is 3,000 tokens of overlapping instructions and the model is paying less attention to all of them.
A quarterly prompt audit: generate a representative test set (200 to 500 real inputs), score the current prompt with your eval harness, cut the prompt by 50%, re-score, and keep the cut if quality holds. Most teams cut by 40 to 60% on first audit with no quality regression. That's a permanent 40 to 60% reduction on the uncached portion of every call. Apply the same logic to RAG context: rerank, truncate to top-k, accept that the model attends best to the start and end of the context window anyway.
Tactic 6: Semantic cache at the gateway
Prompt caching handles identical prefixes. Semantic caching handles near-duplicate queries.
Pattern: when a request comes in, embed the user message, search a small vector store of recent queries, and if the cosine similarity is above a threshold, return the cached response. No model call. Latency drops to single-digit ms, cost goes to zero on the hit.
This pattern lives at the gateway layer, not the application layer. Reasons:
- Hit rate is highest when you cache across all features, not per-feature.
- The cache becomes part of the observability story (every cache hit is a logged event).
- Stale-cache management is a platform problem, not an app problem.
# Client code stays clean. Semantic caching is gateway config.
client = OpenAI(
base_url="https://api.respan.ai/v1",
api_key=os.environ["RESPAN_API_KEY"],
)
# Headers control cache behavior per request
response = client.chat.completions.create(
model="gpt-5.4-mini",
messages=messages,
extra_headers={
"x-respan-cache": "semantic",
"x-respan-cache-threshold": "0.95",
},
)Warning: semantic caching can return stale or wrong answers if the similarity threshold is too loose. Start at 0.95+ cosine similarity, log every cache hit, and review false-positive cases weekly. Don't enable on conversational threads where context matters.
For more on the gateway layer, see Best LLM Gateways in 2026.
Tactic 7: Use structured outputs to kill retries
JSON parse failures are pure waste. The model produces 800 tokens of "Sure, here's the JSON..." prose, you try to parse it, it fails, you retry, you pay again.
Structured outputs (the response_format parameter with a Pydantic schema or JSON Schema) eliminates this. The model is constrained at the sampler level to produce schema-conforming output. Retry rates on the affected workloads typically drop from 5 to 10% to under 1%.
from pydantic import BaseModel
class Classification(BaseModel):
category: str
confidence: float
reasoning: str
response = client.beta.chat.completions.parse(
model="gpt-5.4-mini",
messages=messages,
response_format=Classification,
max_tokens=300,
)
result = response.choices[0].message.parsedTwo compounding wins: fewer retries (less spend), and no schema-validation logic to maintain in your app.
Tactic 8: Stream selectively
Streaming is great chat UX. It's expensive when the user closes the tab. Default behavior in most SDKs: even if the client disconnects, the model finishes generating. You pay for tokens nobody reads.
Fix: track client connection state on the server, and cancel the upstream streaming request when the client disconnects. For long generations, also set finite max_tokens so worst-case cost is bounded.
async def stream_to_client(request):
stream = await openai_client.chat.completions.create(
model="gpt-5.4-mini",
messages=request.messages,
max_tokens=2000,
stream=True,
)
try:
async for chunk in stream:
if await request.is_disconnected():
await stream.close()
return
yield chunk
finally:
await stream.close()On long-form generation, this saves 10 to 30% of output token cost.
Tactic 9: Fine-tune only when the math says so
Fine-tuning got cheaper in 2026, but it's not automatically the cheapest path. Inference on a fine-tuned model typically costs 50 to 100% more per token than the base, forever. If gpt-5.4-nano with good prompts hits your quality bar, fine-tuning is more expensive on inference and on team time. Fine-tuning nano usually wins only when the alternative is gpt-5.5 and volume is high (above ~50M tokens/month on the workload).
Always try few-shot prompting and prompt tuning before fine-tuning. See how to evaluate an LLM for the eval methodology.
Tactic 10: Alert on cost spikes per feature
The single biggest cost incident pattern: one feature, one bug, one all-night runaway. An agent loops on itself, a retry policy has no exponential backoff, an internal user accidentally pastes a 200k-token document into a chat box.
You don't catch these from the OpenAI dashboard (granularity too coarse, latency too slow). You catch them at the gateway / observability layer.
Minimum viable cost alerting:
- Tag every API call with
feature,user_tier,environment. - Track tokens-per-minute and dollars-per-hour per tag.
- Alert when any tag exceeds 3x its 7-day rolling average.
This belongs in your LLM observability layer. See LLM Observability: The Complete Guide and Best LLM Observability Tools.
Stacking the tactics
These compound, not add. A realistic stack: prompt caching (50% off uncached input), Batch API on offline workloads (50% off the batch-eligible slice), right-sizing 60% of traffic to gpt-5.4-nano (80%+ savings on that slice), semantic cache at the gateway (10 to 15% absolute reduction), output token caps (10 to 20% off output spend), stream cancellation (5 to 10% off long-form output).
A team starting from "everything on gpt-5.5 with no caching" can realistically hit 70 to 85% cost reduction in a quarter without quality regression. The hard part isn't the tactics; it's measuring per-feature cost well enough to know which lever to pull when.
FAQ
Does prompt caching require a special API parameter?
No. It applies automatically when your prompt prefix exceeds the minimum cached length (typically around 1024 tokens) and has been seen recently. You'll see cached_tokens in the usage object on responses.
How long does the prompt cache persist? A short window (minutes, not hours), refreshing on use. Fine for stable high-traffic prompts; less effective on low-traffic features.
Is Batch API really 50% off? The discount has been roughly 50% off live pricing through 2025 and 2026. Confirm the current rate on OpenAI's pricing page before sizing a workload.
How do I pick between gpt-5.4-nano and gpt-5.4-mini? Run an eval. Score both against your quality bar on a representative test set, compare cost. If nano passes, use nano.
Does semantic caching work for chat conversations? Carefully. Multi-turn context breaks naive semantic caching because turn N depends on turn N-1. Either cache only the first turn or use full-conversation embeddings as the cache key. Default off for chat.
Should I build my own gateway for these features? For most teams, no. Off-the-shelf gateways (Respan, OpenRouter, Portkey, LiteLLM) handle caching, fallback, cost attribution, and routing.