Most teams over-engineer at v0 because every blog post and tweet is selling them more complexity. Vector database vendors want you to pick a vector database. Agent framework maintainers want you to pick an agent framework. Fine-tuning platforms want you to fine-tune. Each one is reasonable in isolation. Stacked together at v0, they bury the actual product under three layers of plumbing that you cannot debug.
This page is the opposite. It is the minimum stack that handles roughly 95% of v0 use cases up to about 10,000 requests per day. If you ship this and instrument it, you will have a working product, traces to learn from, and a clear signal for what to add next.
The stack
LLM. A frontier API. Use GPT-4o-mini or Claude Haiku when cost matters. Use GPT-4o or Claude Sonnet 4.5 when quality matters. Do not start with a local model. You are buying intelligence per dollar, and the frontier vendors are still the cheapest source of it for v0 traffic.
Gateway. Any LLM gateway. This is a two-character change to your code (swap the base URL) and you get logging, fallbacks across providers, response caching, and per-key cost caps. Respan has a free tier if you want to start without a credit card. The point is not which gateway. The point is having one before you have a production incident.
Retrieval. pgvector if your knowledge base is more than about 30 documents. Otherwise, stuff the documents directly into the prompt. Modern context windows are large enough that "just paste it" beats a retrieval system you have not measured. If you already run Postgres, pgvector is one extension and one new column.
Orchestration. Plain Python or TypeScript. No LangChain. No agent framework. A few helper functions. You will read this code in six months and understand it. You will not understand a chain you wrote against a framework that has since rewritten its abstractions twice.
Memory. A conversation history list passed on each call. No memory layer, no episodic store, no vector memory. Persist the list in Redis or a database row when you need multi-instance. That is the entire memory system at v0.
Prompts. Manage them in a prompt registry from day one if more than one person will edit prompts. Otherwise, keep them in code until you hit prompt number four. After that the diffs in git get hard to read and you want versioning and a non-engineer edit path.
Tracing. A tracing tool from day one. The sooner you have traces, the sooner you can debug. Without traces you are guessing at why a model produced a bad answer. With traces you can see the inputs, the prompt, the tool calls, and the output for any session in production.
Evals. One golden dataset of 50 inputs, two LLM-as-judge graders, and one CI gate that blocks a PR when scores drop. From day one. Fifty examples is enough to catch regressions. You add more later when you find the gaps.
Not in v0. Search infrastructure, fine-tuning, agent frameworks, local models, GPU clusters, custom embeddings, multi-agent orchestration, knowledge graphs. Add only when measurements show you need them. Most teams never need most of them.
A working v0, end to end
Roughly 50 lines. This runs.
import os
from openai import OpenAI
# Gateway: any provider. Direct OpenAI for dev, Respan for production.
client = OpenAI(
base_url=os.getenv("LLM_BASE_URL", "https://api.openai.com/v1"),
api_key=os.getenv("LLM_API_KEY"),
)
# Conversation history (in-memory; replace with Redis/DB for multi-instance)
def chat(history, user_message):
history.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=history,
)
reply = response.choices[0].message.content
history.append({"role": "assistant", "content": reply})
return reply, history
# Optional: simple keyword retrieval over a small KB
KB = {} # 30-100 documents keyed by topic words
def retrieve(question):
return [v for k, v in KB.items() if any(w in question.lower() for w in k.split())]
# Optional: an LLM-as-judge eval
def evaluate(input_text, output_text):
judge = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Did this answer the question? Reply 'pass' or 'fail'.\nQ: {input_text}\nA: {output_text}",
}],
)
return judge.choices[0].message.content.strip() == "pass"
if __name__ == "__main__":
history = [{"role": "system", "content": "You are a helpful assistant."}]
reply, history = chat(history, "Hi! What can you do?")
print(reply)That is a full v0. The next step is wiring tracing, moving the prompt into a registry, and running the eval suite on a CI gate. Chapter 1 walks through that build process in order.
What this costs
- LLM: roughly $0.50 to $5 per 1,000 user requests on GPT-4o-mini.
- Gateway: free tier on most platforms, including Respan.
- pgvector: $0 per month if you already run Postgres, around $25 per month on a managed instance.
- Tracing, prompt registry, evals: free tier.
- Total: under $50 per month until you cross 10,000 requests per day.
You can get a real product into users' hands at coffee-money cost. The expensive thing is building it. The infra is not the expensive thing.
When to upgrade
Cost. When frontier model spend gets real, route a fraction of traffic to a smaller model. Then a cheaper provider. Then potentially a local model. Do not start there.
Latency. Switch to streaming first. Then to a smaller model. Then potentially to local inference. Most "AI is slow" complaints are fixed by streaming the first token.
Quality. Add RAG (Chapter 2.3) when the model lacks domain knowledge. Then consider fine-tuning on the narrow tail (Chapter 2.4) once RAG is the bottleneck.
Memory. Upgrade when users complain that the assistant forgets things across sessions. Not before.
Agent framework. Adopt one when plain Python feels redundant across many similar agents. If you have one or two agents, stay with helpers.
Things not in v0 that should be in v1
- Online evals scoring a sample of production traffic.
- PII redaction at the gateway, before logs and before the model.
- Per-customer cost caps so a single user cannot run up your bill.
- Multi-environment prompt deployments (dev, staging, prod) with promotion controls.
These are the first things you will reach for once the v0 has real users. They are easy to add when the foundation is in place.
Closing
The minimum stack lets you focus on the product, not the infra. Once it is running and traced, the rest of this chapter (RAG, agents, fine-tuning, search, local models) becomes a measurement-driven decision instead of a checkbox you ticked because someone on Twitter said you should.
Pick the stack. Ship the v0. Read the traces. Then come back and pick the next thing.
