Your support agent answers the same onboarding question for the 400th time. Not because the answer is hard to find - it is sitting in three different knowledge base articles. The problem is that the agent has no memory of the last 399 times it handled this exact scenario, no record of which answer format actually resolved the ticket, and no awareness that this particular customer asked the same question two weeks ago and got a confusing response.

This is the gap between RAG and memory. RAG gives your agent access to documents. Memory gives it continuity. Most teams building AI agents today have invested heavily in the first and barely thought about the second.

The Memory Taxonomy That Actually Matters

Lilian Weng's foundational framework maps agent memory to human cognitive science: short-term (working memory), long-term (persistent storage), and the often-ignored episodic memory (specific past experiences). This taxonomy is useful, but the practical question is simpler: what does your agent need to remember, for how long, and how fast does it need to recall it?

Short-term memory is your conversation buffer. It holds the current interaction context - the last N messages, the current task state, any intermediate results. In most frameworks, this is just the messages array you pass to the LLM. The constraint is the context window. Even with 200K token windows, you hit degradation long before you hit the limit. Research from LMSys and others shows that retrieval accuracy drops significantly when relevant information sits in the middle of a long context rather than at the beginning or end.

Long-term memory persists across sessions. This is where you store user preferences, learned facts, extracted entities, and anything the agent should recall next week or next month. The implementation is usually a vector store (embeddings + similarity search) or a structured key-value store, sometimes both.

Episodic memory is the least understood and most powerful type. It records what happened - not just facts, but sequences of events with their outcomes. "Last time we tried approach X for this customer, it failed because of Y, so we switched to Z." This is the difference between an agent that has knowledge and one that has experience.

Here is how these map to infrastructure:

Memory Type Persistence Typical Store Retrieval Speed When You Need It
Short-term Current session In-memory buffer Instant Always
Long-term Weeks to permanent Vector DB, KV store 50-500ms Multi-session agents
Episodic Permanent with decay Event store + embeddings 100-800ms Agents that learn from experience

Short-Term Memory: The Buffer Window Problem

Most agent frameworks handle short-term memory with a sliding window - keep the last K messages, drop everything else. This works until it does not.

The failure mode is predictable. A user mentions their company name in message 3, discusses requirements in messages 4-8, then asks in message 15 "so can you do that for us?" By message 15, the company name has scrolled out of the buffer. The agent either hallucinates a company name or asks again, destroying user trust either way.

Three patterns that work better than a naive sliding window:

1. Summarization buffers. Instead of dropping old messages, compress them. After every N turns, generate a summary of the conversation so far and prepend it to the context. LangChain's ConversationSummaryBufferMemory implements this pattern. The tradeoff is latency - you are making an extra LLM call to generate the summary - but for longer conversations, the context quality improvement is worth it.

2. Entity extraction. Parse each message for named entities (people, companies, products, dates) and maintain a separate entity store that persists for the full session. When the user says "they" in message 20, you can resolve the reference because you have the entity graph from message 3. Harrison Chase wrote about this pattern in the context of LangGraph's memory management, and it is one of the highest-impact improvements you can make to a conversational agent.

3. Scoped retrieval. Not everything in the conversation is equally important. Tag messages by topic or intent, and when context gets long, retrieve the most relevant prior messages rather than the most recent ones. This is essentially RAG over your own conversation history, and it works surprisingly well for task-oriented agents where the user jumps between topics.

The business case for getting short-term memory right is direct: every time your agent loses context mid-conversation, users either repeat themselves (wasting time) or abandon the interaction (losing value). Intercom's engineering blog has published data showing that context loss is the number one driver of negative ratings for AI support agents.

Long-Term Memory: When pgvector Is Enough (and When It Is Not)

Long-term memory is where teams overthink infrastructure and underthink what to actually store. The question is not "which vector database should we use?" The question is "what information, stored in what format, will make this agent measurably better at its job?"

Start with what to store. For a customer-facing agent, the high-value memory items are:

  • User preferences (communication style, timezone, product tier)
  • Resolved issues (what broke, what fixed it, for this specific user)
  • Stated constraints ("I can never deploy on Fridays," "our compliance team requires X")
  • Interaction patterns (this user prefers detailed explanations vs. quick answers)

For internal agents, add:

  • Process knowledge (the actual steps that worked, not the documented process)
  • Exception handling (what to do when the standard procedure does not apply)
  • Team context (who owns what, who to escalate to)

Now the infrastructure question. Redis has published a solid guide on using their stack for both short-term and long-term agent memory, and their architecture is clean. But for most teams, pgvector running alongside your existing PostgreSQL instance handles long-term memory fine up to about 10 million vectors. The pgvector benchmarks show sub-100ms retrieval for datasets under 5M vectors with proper indexing (HNSW), which is fast enough for any conversational agent.

When to upgrade to dedicated infrastructure:

Signal What It Means Consider
p99 retrieval > 200ms Index is too large for Postgres Qdrant, Pinecone, Weaviate
Multi-tenant isolation required Shared Postgres is a compliance risk Dedicated vector DB per tenant
Vector count > 10M HNSW index memory pressure Purpose-built vector store
Need hybrid search (vector + keyword) pgvector's keyword support is limited Elasticsearch + vector, or Vespa

The mistake I see most often is teams jumping to Pinecone or Weaviate before they have validated that memory actually improves their agent's performance. Run the experiment with pgvector first. Measure recall accuracy. If the bottleneck turns out to be retrieval quality rather than retrieval speed, a better embedding model will help more than a faster database.

Episodic Memory: Teaching Agents to Learn from Experience

Episodic memory is where things get interesting - and where most production agents have a complete blind spot. The concept comes from cognitive psychology: humans do not just remember facts, they remember events. "The meeting where the client got angry about the delayed shipment" is an episodic memory that contains context, emotion, causality, and lessons learned, all bundled together.

For AI agents, episodic memory means storing structured records of past task executions:

episode = {
    "task": "resolve_billing_dispute",
    "customer_id": "cust_8821",
    "timestamp": "2026-03-10T14:22:00Z",
    "context": "Customer charged twice for annual plan",
    "actions_taken": [
        "verified duplicate charge in Stripe",
        "issued refund for second charge",
        "applied 10% courtesy credit"
    ],
    "outcome": "resolved",
    "customer_satisfaction": "positive",
    "lessons": "Always check for pending refunds before issuing new ones"
}

Alok Mishra's enterprise memory stack writeup describes a three-tier architecture where episodic memory sits between the working memory layer and the permanent knowledge layer. The episodes are embedded and indexed so the agent can retrieve relevant past experiences when facing similar situations.

The practical implementation pattern that works:

1. Record episodes automatically. After every completed task, extract a structured episode from the interaction trace. This can be an LLM call that summarizes what happened, what worked, and what did not.

2. Index episodes by situation, not just content. Embed the task type, customer segment, and outcome together so that retrieval returns episodes with similar circumstances, not just similar words.

3. Inject relevant episodes into the prompt. When the agent starts a new task, retrieve the 2-3 most relevant past episodes and include them as "prior experience" in the system prompt. This is different from RAG - you are not retrieving documents, you are retrieving the agent's own past behavior and its consequences.

4. Decay old episodes. Not every past experience stays relevant. Implement a decay function that reduces the retrieval weight of older episodes, or set TTLs based on task type. A billing resolution from 6 months ago is probably still relevant. A temporary workaround for a bug that has since been fixed is not.

The Tredence cognitive architecture paper frames this as a "memory lakebase" - a unified storage layer that handles both semantic and episodic retrieval. The concept is sound, though the implementation complexity is real. If you are early in your agent journey, start with structured logging of task outcomes and manual retrieval before building automated episodic memory pipelines.

The Memory Architecture Decision Tree

Here is the framework we use when designing memory systems for production agents:

Start here: Does your agent need to remember anything between sessions?

  • No - Use a simple conversation buffer. You are done.
  • Yes - Continue.

What does it need to remember?

  • User facts and preferences - Key-value store with user ID indexing. Vector search is overkill for structured data. A PostgreSQL table with JSONB columns works fine.
  • Unstructured knowledge from interactions - Vector store (pgvector to start). Embed conversation summaries, not raw messages.
  • What it did and what happened - Episodic memory. Structured event store + embedding index for situation-based retrieval.

How fast does recall need to be?

  • Under 100ms (real-time conversation) - Keep a hot cache (Redis) of frequently accessed memories. Lazy-load from the vector store.
  • Under 500ms (async tasks) - pgvector with HNSW indexing handles this easily.
  • Seconds are fine (batch processing) - Store everything in Postgres, retrieve with standard queries.

OpenAI's memory patterns discussion from their Build Hour reinforces this incremental approach. The teams that ship working memory systems start small and add complexity only when they can measure the improvement.

The Compounding Cost of Memoryless Agents

The business case for agent memory is not about technology - it is about compounding waste. Every time a memoryless agent:

  • Asks a customer to repeat information they already provided
  • Fails to apply a lesson from a previous interaction
  • Ignores a preference the user stated last week
  • Retries an approach that failed for this exact customer before

...it costs you twice. Once in the direct cost of the extra tokens, compute, and time. And again in the trust erosion that makes users doubt whether the agent is actually helpful.

Stripe's engineering team has written about how their internal tools track "context reconstruction cost" - the time humans spend re-explaining things to systems that should already know. This metric applies directly to AI agents. If your agent forces users to re-establish context in more than 10% of interactions, memory is not a nice-to-have. It is the highest-ROI investment you can make in agent quality.

The path forward is straightforward. Start with buffer windows for short-term memory. Add a pgvector-backed long-term store when you have multi-session users. Build episodic memory when your agent runs complex, multi-step tasks where past outcomes inform future decisions. Measure recall accuracy at every stage. And resist the urge to build the full memory stack before you have validated that each layer actually makes your agent better at its job.