Modern websites, ranked in AI searchCited by ChatGPT, Perplexity & Google AI OverviewsLower than your current SEO spendModern websites, ranked in AI searchCited by ChatGPT, Perplexity & Google AI OverviewsLower than your current SEO spendModern websites, ranked in AI searchCited by ChatGPT, Perplexity & Google AI OverviewsLower than your current SEO spendModern websites, ranked in AI searchCited by ChatGPT, Perplexity & Google AI OverviewsLower than your current SEO spend
Agentic AI

Memory Management for AI Agents: Short-Term, Long-Term, and Beyond

How AI agents remember across a turn and across sessions: short-term context, long-term episodic, semantic, and procedural memory, and how it differs from RAG.

Space & Story Team·June 15, 2026·11 min read
AI agent memorylong-term memoryshort-term memoryvector storesagentic design patternscontext window

Based on Agentic Design Patterns by Antonio Gulli (Springer). All book royalties go to Save the Children.

Space & Story Team·June 15, 2026·11 min read
Memory Management for AI Agents: Short-Term, Long-Term, and Beyond

Key Takeaway

AI agent memory has two layers: short-term memory (the context window holding the current conversation) and long-term memory (a vector store that persists episodic, semantic, and procedural memory across sessions). Writing memory is selective extraction; retrieving it ranks candidates on relevance and recency. Memory differs from RAG in that it stores the agent's own interaction history, not shared knowledge.

Why This Matters for Enterprise AI

An agent without memory introduces itself to your customer every single message. It cannot recall the account number the user gave two turns ago, or last week's ticket about the same broken integration. Every interaction starts from zero, and your users feel it.

Memory is what turns a stateless model into something that accumulates context, learns a user's preferences, and gets better at a task the more it runs. It is also where most agent projects break. Teams treat the context window as if it were memory, then watch the agent forget everything the moment a session ends.

The fix is a deliberate memory architecture, and it splits cleanly into two layers: short-term memory that lasts a conversation, and long-term memory that persists across them. If you have read what makes an AI system an agent, memory is the capability behind the "learn and get better" step of the agentic loop.

What Is AI Agent Memory?

AI agent memory is the set of mechanisms an agent uses to retain and recall information, both within a single conversation and across separate sessions over time. Reasoning is only ever as good as the context it runs on, which is why Antonio Gulli, in Agentic Design Patterns, frames memory management as a core pattern. A brilliant model with no memory is a brilliant amnesiac.

The cleanest way to think about it borrows from human cognition. You hold a phone number in working memory just long enough to dial it, then it is gone. You also carry durable knowledge for years: facts you know, events you lived through, skills you can perform without thinking. Agent memory mirrors that split. LangGraph's memory documentation draws the same line. It separates short-term memory, scoped to one conversation thread, from long-term memory shared across all of them.

A small glowing scratchpad node holding a single conversation thread on one side, connected to a larger persistent vault of stored memory records on the other, illustrating short-term versus long-term memory for an AI agent
Short-term memory is the agent's scratchpad for the current conversation; long-term memory is the persistent store it writes to and retrieves from across every session.

The distinction is not academic. Short-term and long-term memory use different storage and different retrieval methods, and they fail in different ways. Conflating the two is the single most common mistake in agent design.

Short-Term Memory: The Context Window

At its simplest, short-term memory is the information an agent holds for the duration of a single conversation. In practice this is the context window, the finite span of tokens the model can attend to on any given call. It is the scratchpad where the running dialogue lives: the user's messages, the agent's replies, this turn's tool results, and any scratch reasoning.

This memory is fast, automatic, and free in the sense that you do not build it. You get it by passing the conversation history back into the model on each call. It is also brutally limited, because every model has a hard token ceiling, and even when the window is large, stuffing it full degrades quality. The model attends less reliably to information buried in the middle of a long context, a failure mode well-documented enough to have its own name in the literature.

So the central job of short-term memory management is fitting what matters into a window that cannot hold everything. Three techniques carry most of the load:

  • Truncation. Keep the most recent N messages and drop the oldest. Simple, cheap, and lossy, since the dropped turns are gone for good.
  • Summarization. When the conversation grows long, compress the older portion into a running summary and keep the recent turns verbatim. Anthropic calls this compaction in its guide on effective context engineering for AI agents: summarize the conversation as it nears the limit, then reinitialize a fresh window with that summary plus the latest exchange.
  • Selective retrieval. Pull only the specific earlier messages relevant to the current turn, rather than replaying the whole transcript.

Summarization is where the craft lives. Compress too aggressively and you lose the one detail that mattered. Anthropic's guidance is to tune the summary prompt for recall first, capturing everything relevant, then tighten for precision once you trust it is not dropping critical context.

Long-Term Memory: Persisting Across Sessions

By contrast, long-term memory is what survives after the conversation ends. Close the chat, come back tomorrow, and the agent still knows your name, your preferences, and what you worked on last time. None of that lives in the context window, which resets every session. It lives in an external store the agent writes to and reads from deliberately.

That store is almost always a vector database. Each memory is saved as an embedding, a numeric vector capturing its meaning, alongside the original text and metadata like a timestamp and a user ID. When the agent needs to remember, it embeds the current situation and retrieves the stored memories whose vectors sit closest to it. This is semantic search, and it is why an agent can recall "the user dislikes morning meetings" when the conversation is about scheduling, even though those exact words were never repeated.

This persistent layer comes in three classic types, a taxonomy drawn straight from cognitive science and adopted across the agent frameworks.

Episodic Memory (Past Events)

Episodic memory stores specific things that happened: past conversations, completed tasks, decisions the agent made and how they turned out. "Last week the user asked me to draft a refund email, and they approved the second version" is one such memory. It lets an agent recall its own history and reuse what worked, which is the raw material for an agent that improves with experience.

Semantic Memory (Facts)

Semantic memory stores facts, stripped of when or how the agent learned them. The user's job title, their billing currency in euros, the fact that their production database runs on postgres. These are durable truths the agent treats as background knowledge. This is the layer that powers personalization, because it remembers a user as a person rather than as a single transcript.

Procedural Memory (Learned Skills)

Procedural memory is the agent's knowledge of how to do things: its rules, its workflows, the refined instructions that govern its behavior. Some of this is baked into the system prompt and the agent's code. The more interesting kind is learned, where the agent updates its own operating instructions from feedback, so a correction you make today changes how it behaves next month. This is the memory type that most resembles an agent getting better at its job over time.

How Memory Works: Writing and Retrieving

A working memory system runs two loops, and they are easy to get backward. The write loop decides what is worth remembering, and the read loop decides what is worth recalling. The answer to both is never "everything."

Writing memory. Not every message deserves to be a permanent memory, or your store fills with noise and retrieval quality collapses, so production systems are selective about what they keep. One approach writes on a schedule, summarizing each session into a few durable facts when it ends. Another lets the agent decide in the moment, calling a save_memory tool when it judges something worth keeping. Either way, the write step usually involves extraction: distilling a long exchange down to the handful of facts, events, or preferences that will matter later, then embedding and storing those.

Retrieving memory. On each turn, the agent embeds the current context and queries the vector store for the closest matches. But pure semantic similarity is rarely enough on its own. The breakthrough pattern, established by the Generative Agents research from Stanford and Google, scores each candidate memory on a weighted blend of three signals:

  • Relevance measures how semantically close the memory is to the current situation, computed as the cosine similarity between their embeddings.
  • Recency favors memories accessed recently, so fresh context outranks something from months ago when both are otherwise equal.
  • Importance lets you mark certain memories as core rather than mundane, so a major decision outweighs a passing remark.

Ranking on relevance and recency together is what keeps retrieval useful as the store grows. A query about scheduling should surface a recent preference about meeting times, not a semantically similar but stale note from a year ago.

Enterprise reality: A support agent for a B2B SaaS company should remember, across sessions, that this customer runs the on-premise deployment, escalated a billing dispute in March, and prefers Slack over email. That context turns a generic bot into one that feels like it knows the account. Without long-term memory, every ticket reopens from scratch and the customer re-explains their setup for the fifth time, which shows up directly in your customer satisfaction scores. The memory store is the difference between an agent that supports a relationship and one that answers a stranger.

Code Example (Abbreviated)

Here is the core of a long-term memory loop with a vector store: save a memory, then later retrieve the most semantically relevant ones for a new situation. The framework specifics vary, but every memory system reduces to these two moves.

# Abbreviated — illustrative store-and-retrieve, not production code
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small") memory = InMemoryVectorStore(embeddings)

# WRITE: persist a distilled fact as a long-term memory memory.add_texts( texts=["User prefers concise answers and dislikes morning meetings."], metadatas=[{"user_id": "u_123", "type": "semantic"}], )

# RETRIEVE: pull the memories most relevant to the current turn relevant = memory.similarity_search( "Schedule a sync with the user for next week", k=3, # top-k closest memories by embedding similarity )

context = "\n".join(doc.page_content for doc in relevant) # 'context' is injected into the prompt so the agent answers with memory

The same shape holds across the major agent frameworks, whether you reach for LangGraph's store, CrewAI's memory module, or another toolkit. You embed on write, you rank on read, and the vector store does the similarity math in between.

Memory Versus Retrieval: How They Differ

Memory and retrieval-augmented generation (RAG) look identical under the hood. Both embed text, both store vectors, both retrieve by semantic similarity. Plenty of teams conflate the two, and the confusion leads to muddled architectures. The difference is what the store contains and who owns it.

Retrieval for AI agents pulls from an external, mostly static knowledge base: your documentation, a product catalog, a body of policy PDFs. It answers "what does the company know about this topic?" That corpus is authored by people and shared across all users.

Memory pulls from a store the agent itself wrote, capturing what happened in its own interactions. It answers "what do I know about this user and our history?" That corpus is generated by the agent, grows turn by turn, and is usually scoped to one user. So retrieval gives the agent shared knowledge, while memory gives it personal continuity. A mature agent uses both, often through the same vector database, for different jobs.

When to Invest in Memory (and When Not To)

Persistent memory carries a cost in storage, latency, and compliance, so add it where it pays off and not by reflex.

  • Build long-term memory when the agent serves the same users repeatedly and personalization or continuity drives the value: assistants, tutors, long-running support, anything with a relationship.
  • Invest in episodic memory when you want the agent to learn from its own past runs and reuse what worked.
  • Lean on procedural memory when the agent's instructions need to evolve from feedback rather than staying frozen in a prompt.

Other agents are better off without it.

  • Skip long-term memory for stateless, single-shot tasks. A one-off classifier or a translation endpoint has no relationship to remember.
  • Be cautious in regulated or privacy-sensitive settings, where persisting user data carries compliance weight. Memory means storing personal information, with everything that implies for retention and the right to be forgotten.
  • Do not reach for a vector store when short-term context already covers the job. If everything the agent needs fits in one conversation, a persistent memory layer is complexity you do not need yet.

There is one more design question worth raising: what should the agent be allowed to forget, and when? An ever-growing store gets slower and noisier, so production systems prune, decay, or consolidate old memories. Forgetting, done deliberately, is a feature. It is also what lets memory scale into multi-agent systems, where a shared store becomes the common ground a team of agents coordinates through.

Key Takeaways

  • AI agent memory splits into short-term memory, the context window holding the current conversation, and long-term memory, an external store that persists across sessions.
  • Short-term memory management is about fitting what matters into a finite window, using truncation, summarization (compaction), and selective retrieval.
  • Long-term memory lives in a vector store and comes in three types: episodic (past events), semantic (facts), and procedural (learned skills and instructions).
  • Memory runs two loops: a selective write loop that extracts and stores what is worth keeping, and a read loop that ranks candidates on relevance, recency, and importance.
  • Memory is not retrieval. Classic RAG pulls shared, authored knowledge, while memory pulls the agent's own interaction history, scoped to the user. A mature agent uses both.

Is your site invisible to AI search?

Get a free AEO infrastructure audit and find out what your competitors are doing that you're not.

Get Your Free Audit
Quick answers

Frequently asked.