How AI Memory Actually Works

March 25, 20267 min read

In my first post, I talked about why AI memory matters. This one gets into the details — how Edward's memory system actually works under the hood.

The Embedding Model

Every memory starts as text and gets converted to a 384-dimensional vector using all-MiniLM-L6-v2 from sentence-transformers. This is a small, fast model that runs locally — no API calls, no latency. The vectors get stored in PostgreSQL using the pgvector extension.

I chose a local embedding model deliberately. Memory operations happen on every conversation turn, sometimes multiple times. Sending each one to an external API would add latency and cost that compounds fast.

Hybrid Retrieval

Vector similarity alone isn't enough. If you ask Edward “what's my dog's name?”, the vector for that question might be close to memories about pets in general, but miss the specific memory that says “user's dog is named Luna” because the word “Luna” doesn't have a strong vector signal.

That's why Edward uses hybrid retrieval: 70% vector similarity + 30% BM25 keyword matching. Vector search handles semantic similarity (“pet” matches “dog”), while BM25 handles exact terms (“Luna” matches “Luna”). The weighted combination consistently outperforms either approach alone.

Extraction Pipeline

Memories don't appear out of nowhere. After each conversation turn, Edward runs an extraction step using Claude Haiku. The model receives the conversation and identifies information worth remembering, classified into four types:

fact — objective information (“works at Acme Corp”)
preference — likes and dislikes (“prefers Python over JS”)
context — situational info (“traveling next week”)
instruction — behavioral directives (“always use metric units”)

The extraction model is cheap and fast. Using Haiku instead of a larger model keeps the per-turn cost negligible while still catching the important stuff.

Deep Retrieval

Basic retrieval runs a single query against the memory store. But some conversations need more context — especially longer ones where the topic has drifted from the original question.

Edward's deep retrieval system kicks in when the message is short (likely a follow-up) or the turn count exceeds 3. It runs 4 parallel queries: the original message plus 3 Haiku-rewritten variants that reframe the question. The results are deduplicated and merged within a context budget of 8,000 characters.

Consolidation

Over time, memories accumulate. Some become stale. Others overlap. The consolidation service runs hourly in the background, using Haiku to:

Cluster related memories into connection groups
Flag memories that might be outdated
Identify contradictions between memories

It's conceptually similar to how your brain consolidates memories during sleep — reorganizing, pruning, and strengthening connections.

The Full Picture

Put it together and you get a four-stage memory lifecycle: extract (after each turn) → retrieve (before each turn) → reflect (after each turn, async) → consolidate (hourly background). Each stage uses the cheapest model that gets the job done, keeping the system fast and affordable to run.

Why Your AI Forgets You

5 min read read

The Heartbeat: An AI That Pays Attention