Almost every v0 product you build should call a frontier API. Self-hosting feels like the cheaper, more controllable path, but the spreadsheet that produces that conclusion almost never includes the engineering time required to keep a GPU healthy. By the time you add it back, GPT-4o, Claude Sonnet, or Gemini wins on total cost of ownership for the vast majority of teams.
This chapter is about knowing when that default flips, and what the middle path looks like.
Frontier vs local: the two sides
Frontier APIs are the hosted models from OpenAI, Anthropic, Google, and Mistral. You send tokens, you get tokens back, you pay per million. There is no GPU to manage, no quantization to tune, no driver to update.
Local or self-hosted means you run the model yourself. That can mean a literal GPU rack on-prem, or it can mean renting GPUs from Together AI, Fireworks, Modal, or RunPod. Either way, you choose the model weights (Llama 3.3, Mistral Large, Qwen, DeepSeek), and you are responsible for serving them.
The line between them is real but narrower than it used to be. Managed open-source providers blur it. We will get to that.
The math people get wrong
The sentence that always starts the wrong analysis is "Llama 3.3 70B is free, so it has to be cheaper." Free weights, real costs.
A real self-hosted deployment includes:
- GPU costs. An H100 runs $1.80 to $4 per hour depending on cloud and commitment. An H200 is higher. Two of them sitting idle overnight is real money.
- Engineering time to keep it running. Inference servers (vLLM, TensorRT-LLM, SGLang) need attention. Someone owns it.
- Auto-scaling, queuing, and retries. Frontier APIs absorb burst traffic for you. Your own GPU does not.
- Observability. Latency, throughput, cache hit rate, token-per-second. You build or buy this.
- Maintenance. OS updates, CUDA driver updates, framework updates, weight updates when a better model drops next month.
- Opportunity cost. The hours your team spends on infra are hours not spent on the product.
The honest comparison is per-token cost plus the fully loaded cost of one engineer-week per month. For most teams that pushes the break-even point much further out than it looks on paper.
The three situations where local actually wins
Skip the philosophy. There are three concrete reasons to self-host.
- Compliance or on-prem requirement. HIPAA workflows handling PHI without a BAA path, classified data, contractual on-prem clauses, certain EU data residency cases. The frontier API is not on the table at all. Self-hosting is not a choice, it is the requirement.
- Cost at extreme scale. You are pushing tens of millions of tokens per day, and the workload is narrow enough that a small fine-tuned model handles it. The math starts working seriously around 50M+ daily input tokens, and only when you can keep GPU utilization high. Below that, frontier wins.
- Latency under 200ms. Your GPU sits in the same datacenter as your app. Network round-trip to OpenAI plus their queue cannot match a colocated 7B model that fits in one H100. Voice agents, real-time gaming NPCs, and high-frequency trading-style use cases live here.
If none of these three describe your product, use a frontier API.
The frontier defaults
For everything else, here is a sane default lookup.
| Use case | Pick |
|---|---|
| Default for everything | GPT-4o or Claude Sonnet 4.5 |
| Cost-sensitive, fast | GPT-4o-mini, Claude Haiku, Gemini Flash |
| Long-context, high-quality reasoning | Claude Opus 4.5 |
| Open-source equivalent (managed) | Llama 3.3 70B via Together or Fireworks |
You can change this every six months as the leaderboard moves. The point is to start with one and only revisit the choice when something measurable forces the change.
The middle path: managed open-source
Together AI, Fireworks, Modal, Anyscale, and Baseten will serve open-source models behind an OpenAI-compatible API. You get Llama 3.3 70B for roughly $0.50 per million input tokens versus $5 to $15 per million for frontier-quality closed models, without owning a single GPU.
This is the right answer surprisingly often:
- You want to fine-tune on your data and own the weights.
- You want lower per-token cost without the operational burden.
- You want the option to bring it on-prem later without rewriting your app.
The tradeoff is that managed open-source still has gaps versus the frontier on the hardest reasoning tasks. Test before you commit.
The 90/10 routing pattern
Most production stacks end up here: 90% of traffic goes to a frontier API for quality, and 10% goes to a self-hosted or managed open-source model for cost-sensitive or latency-critical paths. Bulk classification, embeddings, content moderation, summarization of long documents, anything where the marginal token quality does not matter much.
The LLM gateway from Chapter 1.2 is what makes this transparent. Your application code calls one endpoint, and routing rules pick the model. When the cost or latency picture changes, you change the rule and not the code.
When it is worth running your own GPU
For the small set of teams that genuinely should self-host on their own hardware, here is the checklist:
- You have ML or infra engineers who want to run GPUs and have done it before.
- Your load is steady and predictable. Bursty traffic crushes self-hosted economics.
- You can keep utilization above 60%. Below that, you are paying for idle silicon.
- You have measured the actual frontier cost, not estimated it. Real bills, not back-of-envelope.
Three out of four is not enough. All four.
Privacy and compliance, for real
The most common bad reason to self-host is "we handle sensitive data." Most healthcare and financial AI in production today uses frontier APIs under enterprise agreements: Azure OpenAI under a BAA, Anthropic via AWS Bedrock under a BAA, Google Vertex with HIPAA coverage. Those paths are real and well-documented.
Going on-prem purely for the feeling of privacy, with no specific legal or contractual driver, is over-engineering. It buys you a quarter of lost velocity in exchange for a checkbox you did not actually need.
If your legal team says on-prem, on-prem. If your gut says on-prem, ask legal.
Wrap up
Default to frontier. Reach for managed open-source when cost or fine-tuning matters. Reach for true self-hosting when compliance, extreme scale, or sub-200ms latency forces it. Use a gateway so you can change your mind without rewriting.
Next, Chapter 2.7: The default minimum AI stack puts these picks together into one starter architecture.
Related: LLM gateway basics shows how routing across providers works in practice. How to choose the right AI stack walks through the decision framework end to end.
