WorkProjectsBlogAboutContact

Why AI Agents Need Better Memory

May 21, 2026·
agentsmemoryinfrastructure

The most common complaint about AI agents isn't that they're too slow or too expensive. It's that they're forgetful.

You spend 20 minutes establishing context in a conversation, close the tab, come back the next day, and start from zero. For single-turn question answering, this doesn't matter. For agents that operate over days, weeks, or complex multi-session workflows — it's a fundamental problem.

The Context Window Isn't Enough

LLMs have a context window — a fixed number of tokens they can "see" at once. GPT-4 has 128k tokens. Claude has 200k. That sounds like a lot, but it's a sliding window. Nothing outside it exists.

The real issue isn't size. It's structure. A context window is a flat list of tokens. Useful memory has hierarchy, relevance, and recency decay. You don't remember your first grade classroom with the same fidelity as your last conversation. Memory systems should work the same way.

What Agents Actually Need

After building Engram, I think there are three distinct layers any serious agent memory system needs:

1. Episodic memory — what happened, in order. The log of events. This is the raw material.

2. Semantic memory — what things mean. Compressed summaries, facts, and relationships extracted from episodes. This is what lets you avoid re-reading 10k tokens of chat history to answer "what's the user's preferred programming language?"

3. Working memory — what's relevant right now. A runtime-loaded context window populated from the other two layers based on the current task.

Most "memory" solutions I've seen only implement episodic. They just append to a text file and RAG over it. That works for demos. It doesn't scale.

The Hard Part Is Retrieval, Not Storage

Storing memories is easy. Deciding which memories to load — and when — is hard.

Bad retrieval manifests in two ways:

  • Recency bias: always pulling the most recent memories, missing older but more relevant ones
  • Topic mismatch: embedding-based similarity works great for semantic search but fails for procedural memories ("how does this user like their code reviewed?")

Engram uses a hybrid retrieval approach: semantic similarity for facts, recency-weighted scoring for episodic recall, and explicit key-value storage for high-priority preferences and settings. The retrieval layer decides which combination to use based on the query type.

What I Got Wrong Initially

My first version of Engram tried to be too clever. I had a graph-based memory structure where entities were nodes and relationships were edges. It was elegant on paper.

In practice, it was slow to write to, expensive to query, and brittle when the same entity was referenced in multiple ways ("Saalik", "the user", "he").

I ripped it out and went back to vectors + metadata filters. Sometimes the boring solution is right.

Where This Goes

Memory is infrastructure. Like databases or message queues, it should be boring, reliable, and invisible. The goal is for developers building agents to not think about memory at all — just call memory.save() and memory.query() and trust it works.

That's what I'm building. Still a long way to go.


Engram is open source. Try the live demo or read the project page.