Most "memory" problems in LLM apps are not actually memory problems. They are conversation-history problems. A developer ships a chatbot, the chatbot forgets what the user said two turns ago, the developer Googles "LLM memory" and ends up installing a vector database, an extraction pipeline, and a memory framework. Then the bug turns out to be that they never passed the previous messages back to the model on the next call.

Before you reach for a memory layer, make sure you actually need one. This chapter walks through the three levels of memory in LLM apps, when each is worth the engineering cost, and what to do when your conversation history outgrows the context window.

The three levels of memory

There are really only three patterns you need to understand.

Stateless. Every request to the model is independent. The model has no idea what you asked it five seconds ago. This is the default behavior of every LLM API. It is also the right answer more often than people think.

Conversation history (session memory). You keep a messages list in memory or in a database, and you append every user turn and assistant reply to it. On each new call, you pass the entire list back to the model. The model now "remembers" everything in the current session because you handed it the transcript. This is what ChatGPT does. It is also what 90 percent of chat products need.

Cross-session memory. Facts about a user persist across sessions. "User is allergic to peanuts." "User prefers Spanish." "User is a vegetarian who is training for a marathon." These facts live in a separate database, and you fetch them at the start of every new session and inject them into the system prompt. This is the only pattern that is genuinely hard to build well.

Decision rules

Use the level that matches your actual product, not the level that sounds impressive.

Stateless is fine for classification, extraction, one-shot Q&A, summarization, translation, code review on a single file. If each request stands alone, do not add state.

Conversation history is fine for chatbots, customer support agents, coding assistants within a session, and almost every "ChatGPT-like" product. Just keep the messages list and pass it every call.

Cross-session memory is needed when users actively complain that the AI forgets them between sessions, AND that complaint is causing churn or losing revenue. Not before. Building cross-session memory is a real engineering project, and most products do not need it.

Conversation history in code

Here is the entire pattern. There is no framework. There is no vector database. There is a list.

from openai import OpenAI
 
client = OpenAI()
history = [{"role": "system", "content": "You are a helpful assistant."}]
 
def chat(user_message):
    history.append({"role": "user", "content": user_message})
    response = client.chat.completions.create(model="gpt-4o-mini", messages=history)
    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})
    return reply
 
print(chat("My name is Alex."))
print(chat("What's my name?"))  # remembers from history

The model "remembers" the name because the second call sends the full history, which contains the first message. That is the entire trick. In production, you would store history in a database keyed by session id and load it at the start of every request. The mechanism does not change.

If you are debugging why your bot forgets things, the first thing to check is whether you are actually passing the prior messages on each call. Respan's tracing makes this visible: open any trace and you can replay the exact messages array that was sent to the model on every turn, so you see what the model saw, not what you assumed it saw.

Cross-session memory: the pattern

When you genuinely need facts to persist across sessions, the shape is straightforward. You need three things: a place to store facts, a way to extract new facts after each conversation, and a way to inject relevant facts into the system prompt at the start of the next session.

def start_session(user_id):
    facts = db.fetch_user_facts(user_id)  # returns list of strings
    profile = "\n".join(f"- {f}" for f in facts)
    system_prompt = f"You are a helpful assistant.\n\nUser profile:\n{profile}"
    return [{"role": "system", "content": system_prompt}]
 
# After the conversation ends, run a separate extraction call:
def extract_facts(transcript):
    # Ask the model: "What new facts about the user did we learn?"
    # Append results to db.user_facts[user_id]
    ...

The hard parts are not the storage. The hard parts are deciding what counts as a fact worth remembering, deduplicating against existing facts, handling contradictions ("I moved to Berlin" should overwrite "I live in London"), and keeping the system prompt from growing without bound.

When memory frameworks are worth it

Tools like Zep, Mem0, and LangChain memory exist for a reason. They handle fact extraction, deduplication, and retrieval automatically. They are worth it when you have many users, you need cross-session memory, and you do not want to build the extraction pipeline yourself.

They are overkill when your user base is small, the facts you want to remember are simple and structured (preferred language, allergies, time zone), or you are still validating whether users even want this feature. A Postgres table with a user_id column and a facts JSONB column will get you surprisingly far.

Context window limits

Every conversation eventually gets long enough that the full history exceeds the model's context window. You have three options.

Truncation. Drop the oldest messages. Easy to implement. Loses information. Fine for most chat products because users rarely care what they said 200 turns ago.

Summarization. When the history gets long, summarize the older portion into a single "summary" turn at the top, and keep the recent turns verbatim. Preserves more context. Costs an extra model call to generate the summary.

Selective retrieval. Treat your conversation history like a knowledge base. Embed every turn, and at each new request, retrieve only the past turns that are semantically relevant. This is RAG over your own conversation. Powerful, but probably more complexity than you need until your conversations span hundreds of turns.

The closing rule

Do not add a memory layer until users are actively complaining about the lack of one. Conversation history (a list you append to) covers almost everything. Cross-session memory is real engineering and should be earned by user demand, not added preemptively.

When you do add memory and things go wrong (the bot misremembers a fact, contradicts itself, forgets something it should know), the debugging path starts with looking at exactly what was in the system prompt and message history on the failing turn. That is where tracing earns its keep.

The three levels of memory

There are really only three patterns you need to understand.

Decision rules

Use the level that matches your actual product, not the level that sounds impressive.

Stateless is fine for classification, extraction, one-shot Q&A, summarization, translation, code review on a single file. If each request stands alone, do not add state.

Conversation history in code

Here is the entire pattern. There is no framework. There is no vector database. There is a list.

from openai import OpenAI
 
client = OpenAI()
history = [{"role": "system", "content": "You are a helpful assistant."}]
 
def chat(user_message):
    history.append({"role": "user", "content": user_message})
    response = client.chat.completions.create(model="gpt-4o-mini", messages=history)
    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})
    return reply
 
print(chat("My name is Alex."))
print(chat("What's my name?"))  # remembers from history

Cross-session memory: the pattern

def start_session(user_id):
    facts = db.fetch_user_facts(user_id)  # returns list of strings
    profile = "\n".join(f"- {f}" for f in facts)
    system_prompt = f"You are a helpful assistant.\n\nUser profile:\n{profile}"
    return [{"role": "system", "content": system_prompt}]
 
# After the conversation ends, run a separate extraction call:
def extract_facts(transcript):
    # Ask the model: "What new facts about the user did we learn?"
    # Append results to db.user_facts[user_id]
    ...

When memory frameworks are worth it

Context window limits

Every conversation eventually gets long enough that the full history exceeds the model's context window. You have three options.

Truncation. Drop the oldest messages. Easy to implement. Loses information. Fine for most chat products because users rarely care what they said 200 turns ago.

LLM Agent Memory

The three levels of memory

Decision rules

Conversation history in code

Cross-session memory: the pattern

When memory frameworks are worth it

Context window limits

The closing rule

What to read next

Built for AI agents.
Break less.
Ship more.

LLM Agent Memory

The three levels of memory

Decision rules

Conversation history in code

Cross-session memory: the pattern

When memory frameworks are worth it

Context window limits

The closing rule

What to read next

Built for AI agents.
Break less.
Ship more.

LLM Agent Memory

The three levels of memory

Decision rules

Conversation history in code

Cross-session memory: the pattern

When memory frameworks are worth it

Context window limits

The closing rule

What to read next

Built for AI agents. Break less. Ship more.

LLM Agent Memory

The three levels of memory

Decision rules

Conversation history in code

Cross-session memory: the pattern

When memory frameworks are worth it

Context window limits

The closing rule

What to read next

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.