The most common over-engineering pattern in AI builds today goes like this: "I'm building an agent, so I need LangChain." That instinct is usually wrong. A lot of teams reach for a framework before they've even written the loop they're trying to abstract, and they end up spending more time fighting the framework than they would have spent writing the thing from scratch.
This chapter is about when an agent framework actually earns its place in your stack, and when plain Python with the model SDK is faster, simpler, and easier to debug.
What an agent framework actually provides
Before we compare options, it helps to know what you're buying. An agent framework typically gives you:
- Planning loops. A built-in "think, act, observe" loop so you don't have to write the while-loop yourself.
- Tool routing. A way to register functions as tools and let the model pick one based on a description.
- Memory abstractions. Conversation history, scratchpads, vector store retrieval glued in.
- Callbacks and tracing hooks. Places to plug in logging, evals, and observability.
- Prebuilt agents. Templates like ReAct, Plan-and-Execute, or multi-agent crews you can instantiate with a few lines.
That's the upside. The downside is that most of these abstractions sit on top of the same two or three primitives: a model call, a function call, and a list of messages. If your use case is straightforward, the framework hides things you actually want to see.
The major frameworks at a glance
| Framework | Strength | Weakness | When to use |
|---|---|---|---|
| LangChain | Largest ecosystem, lots of integrations | Heavy abstractions, debugging is hard, prompts hidden | Multi-tool agents with many integrations |
| LlamaIndex | Excellent at RAG over unstructured data | Less general | RAG-heavy products |
| OpenAI Agents SDK | Native, lightweight, plays well with OpenAI tools | OpenAI-only, newer | OpenAI-first builds |
| CrewAI | Multi-agent orchestration with role definitions | Newer, smaller community | Multi-agent workflows |
| AutoGen | Microsoft research-grade, multi-agent | Heavier, less polished | Research and experimental multi-agent |
None of these are bad. They all solve real problems. The question is whether you have the problem they solve.
The decision rule
Here is the rule I'd give a beginner, and honestly most teams shipping in production:
- Try plain Python first. Write the loop. Call the model. Parse the response. Run the tool. Append to history. Loop again. Most agent workflows fit in 100 to 200 lines.
- Add a framework only when you hit the second abstraction pain point. The first one is fine, you can write a helper function. The second time you find yourself reimplementing something a framework already does well, that's the signal.
- Default to the lightest option that fits. If you're OpenAI-first, the OpenAI Agents SDK is closer to plain Python than LangChain. If you're doing RAG over documents, LlamaIndex is purpose-built and the abstraction earns its keep.
The most common over-engineering trap I see: a LangChain wrapper around a 200-line workflow that calls one model and one tool. The wrapper adds dependencies, hides the prompt, and makes the trace harder to read. It's slower to build than the bare version would have been.
What plain Python actually looks like
Here is a working agent loop using only the openai SDK. It calls a tool, observes the result, and answers. About 35 lines.
from openai import OpenAI
import json
client = OpenAI()
def get_weather(city: str) -> str:
# In reality, call a weather API.
return f"It is 18C and cloudy in {city}."
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
while True:
resp = client.chat.completions.create(
model="gpt-4o-mini", messages=messages, tools=tools
)
msg = resp.choices[0].message
messages.append(msg)
if not msg.tool_calls:
print(msg.content)
break
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
result = get_weather(**args)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result,
})That's a real agent. It plans, it calls a tool, it observes, and it answers. You can read every line, you can step through it in a debugger, and you can print any variable. No abstraction is hiding the prompt.
If your project is a chatbot with two or three tools, this is your whole architecture. You don't need more.
When a framework genuinely helps
Frameworks earn their place when the problem outgrows the loop above. Real signals:
- Multi-step planning with retries. You need a planner that decomposes a goal into subtasks, executes each, and replans on failure. Writing this from scratch is doable but tedious, and prebuilt patterns like Plan-and-Execute save real time.
- Complex memory. You need long-term memory, summarized history, and retrieval over past sessions wired together. LlamaIndex and LangChain both have solid building blocks here.
- Multi-agent coordination. You have multiple specialized agents that talk to each other, hand off work, and share state. CrewAI and AutoGen exist precisely for this.
- A large team that needs prebuilt patterns. If five engineers are each going to build agents and you want consistency, a framework gives them rails.
If none of those describe you, plain Python is probably faster.
Whatever you choose, traces have to make sense
The single biggest cost of picking a heavy framework is debugging. When something goes wrong in production, you need to see the actual prompt the model received, the tool calls it made, the responses it got back, and the time each step took. Frameworks that hide prompts behind abstractions become debugging traps unless they auto-instrument with an observability layer.
This is true for plain Python too. Even a 30-line loop benefits from tracing once it runs in production with real traffic. Tools like Respan auto-instrument the OpenAI SDK and most agent frameworks, so you get a structured trace of every step without rewriting your code. Whatever stack you pick, make sure you can answer "what did the model see and what did it do" in under a minute.
Closing
The honest summary: most teams should start with plain Python and the model SDK, add a framework when they hit the second pain point, and instrument tracing from day one. Frameworks are tools, not architectural decisions. Pick the lightest one that solves a problem you actually have.
Next, we'll look at how memory works in practice, because that's where most agents stop feeling smart.
