How to Build Claude Mem: The Production-Grade Blueprint
A deep dive into building Claude Mem. Full tech stack breakdown, implementation code, and scaling strategies for enterprise use cases.
How to Build Claude Mem: The Production-Grade Blueprint
Building Claude Mem-a persistent, episodic memory layer for Anthropic’s Claude-is the difference between a chat bot and a Senior AI Engineer. While open-source plugins like claude-mem are excellent for local development, enterprise environments require a stateless, scalable architecture that doesn’t rely on a local SQLite file.
This guide details how to build a Model Context Protocol (MCP) server that acts as a centralized memory brain for your entire engineering team, capable of handling thousands of concurrent sessions with sub-second retrieval latency.
🏗️ The Architecture
We are not just scripting a vector store wrapper; we are engineering a “Memory Sidecar” pattern using the Model Context Protocol (MCP).
The Problem
Standard RAG is static. It retrieves documentation but forgets decisions. If you tell Claude, “We use UUID v7 for primary keys,” standard RAG won’t capture that preference for the next session. Claude Mem solves this by implementing Episodic Memory: recording interactions, decisions, and outcomes in real-time, then compressing them into semantic “memories” that are injected into future contexts.
The Data Flow
- Interception: The MCP Server exposes tools (
remember_decision,search_history) that Claude uses to explicitly store context. - Ingestion (Hot Path): Memories are written immediately to a write-ahead log (WAL) in Redis for instant availability.
- Distillation (Cold Path): An async worker (Celery/Arq) picks up raw logs, uses a small LLM (Claude Haiku) to summarize/deduplicate them, and pushes vectors to PostgreSQL (pgvector).
- Retrieval: When a new session starts, the MCP server performs a hybrid search (Semantic + Time-decayed) to load the “Working Set” of memories.
🛠️ The Stack
- Interface: Python MCP SDK (Model Context Protocol)
- Storage: PostgreSQL + pgvector (The only DB you need for relational + vector)
- Queue: Redis (For async summarization buffers)
- LLM: Anthropic Claude 3.5 Haiku (For fast, cheap summarization)
💻 Implementation
Below is the core implementation of a Production Memory MCP Server. We use fastmcp (or the standard mcp library) with distinct read/write paths to prevent latency spikes during conversation.
Note: This code assumes you have a Postgres instance with the vector extension enabled.
import os
import json
import asyncio
import logging
from datetime import datetime, timezone
from typing import List, Optional, Dict, Any
# Using the standard MCP Python SDK (2025 Standard)
from mcp.server.fastmcp import FastMCP
from pydantic import BaseModel, Field
import asyncpg
from anthropic import AsyncAnthropic
# Configuration
DB_DSN = os.getenv("DATABASE_URL", "postgresql://user:pass@localhost:5432/memdb")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
# Initialize MCP Server
mcp = FastMCP("EnterpriseClaudeMem")
anthropic = AsyncAnthropic(api_key=ANTHROPIC_API_KEY)
# Logger Setup
logger = logging.getLogger("claude_mem")
logger.setLevel(logging.INFO)
class MemoryItem(BaseModel):
content: str
tags: List[str] = []
confidence_score: float = Field(default=1.0, ge=0.0, le=1.0)
async def get_db_pool():
"""Singleton DB Pool with robust error handling"""
if not hasattr(get_db_pool, "_pool"):
try:
get_db_pool._pool = await asyncpg.create_pool(
DB_DSN,
min_size=5,
max_size=20,
command_timeout=10
)
except Exception as e:
logger.critical(f"Failed to connect to DB: {e}")
raise RuntimeError("Database connection failure")
return get_db_pool._pool
async def generate_embedding(text: str) -> List[float]:
"""
Mock embedding generation for brevity.
In production, use Bedrock or OpenAI embeddings.
"""
# await client.embeddings.create(...)
return [0.1] * 1536
@mcp.tool()
async def save_memory(content: str, category: str = "general", importance: int = 1) -> str:
"""
Saves a critical piece of information (decision, preference, or architectural choice)
to long-term memory.
Args:
content: The fact or decision to remember.
category: 'architecture', 'preference', 'bugfix', or 'business_rule'.
importance: 1-5 scale of how critical this memory is.
"""
pool = await get_db_pool()
embedding = await generate_embedding(content)
# We use a 'memories' table with HNSW index
query = """
INSERT INTO memories (content, category, importance, embedding, created_at)
VALUES ($1, $2, $3, $4, $5)
RETURNING id;
"""
try:
async with pool.acquire() as conn:
row_id = await conn.fetchval(
query,
content,
category,
importance,
str(embedding), # pgvector often takes string representation or list
datetime.now(timezone.utc)
)
# Trigger async summarization/consolidation here (e.g., via Redis queue)
# await redis.enqueue("consolidate_memory", row_id)
return f"Memory stored successfully [ID: {row_id}]. I will recall this in future sessions."
except Exception as e:
logger.error(f"Write failed: {e}")
return "System Error: Failed to write memory. Please try again."
@mcp.tool()
async def search_memory(query: str, limit: int = 5) -> str:
"""
Retrieves relevant memories based on semantic similarity.
Use this when you need context about past decisions or project rules.
"""
pool = await get_db_pool()
query_embedding = await generate_embedding(query)
# Hybrid search: Vector Similarity * Importance Booster * Time Decay
sql = """
SELECT content, category, created_at,
(1 - (embedding <=> $1)) as similarity
FROM memories
WHERE (1 - (embedding <=> $1)) > 0.7
ORDER BY similarity DESC, importance DESC
LIMIT $2;
"""
try:
async with pool.acquire() as conn:
rows = await conn.fetch(sql, str(query_embedding), limit)
if not rows:
return "No relevant memories found."
results = []
for r in rows:
age = (datetime.now(timezone.utc) - r['created_at']).days
results.append(f"[{r['category'].upper()}] {r['content']} (Age: {age} days)")
return "\n".join(results)
except Exception as e:
logger.error(f"Read failed: {e}")
return "System Error: Memory retrieval unavailable."
@mcp.resource("memory://recent_summary")
async def get_recent_summary() -> str:
"""
Exposes the 'Working Memory' - a summary of the last 24h of interactions.
This is auto-injected into Claude's context window by the MCP client.
"""
pool = await get_db_pool()
# Fetch high-importance recent items
rows = await pool.fetch("""
SELECT content FROM memories
WHERE created_at > NOW() - INTERVAL '24 hours'
AND importance >= 3
""")
return "\n".join([r['content'] for r in rows]) if rows else "No recent high-priority events."
if __name__ == "__main__":
# Run the MCP server
# In production, use: mcp run app:mcp --transport stdio
mcp.run()
⚠️ Production Pitfalls (The “Senior” Perspective)
When scaling this to an entire engineering org, here is what breaks:
1. The “Context Poisoning” Trap
Issue: If you simply SELECT * and dump memories into context, you will confuse the model with outdated information.
Fix: Implement Progressive Disclosure.
- Layer 1 (System Prompt): Inject only the “Project Manifesto” (High-level rules).
- Layer 2 (Tool Call): Claude must explicitly call
search_memorywhen it lacks context. - Layer 3 (Pruning): Run a nightly cron job using Claude Haiku to merge conflicting memories (e.g., “We use React” vs “We migrated to Svelte”).
2. Latency Spikes
Issue: Generating embeddings on the write path (inside save_memory) adds 500ms+ latency.
Fix: Use the Outbox Pattern.
- Write the raw text to a Postgres table
raw_logsimmediately. - Return “Saved” to the user instantly.
- Have a background worker (CDC or Polling) pick up
raw_logs, generate embeddings, and move them tomemories.
3. Multi-Tenancy Security
Issue: Standard vector stores are flat. One team’s “API Key” memory might leak to another team. Fix: Row-Level Security (RLS) in Postgres.
- Add a
project_idortenant_idcolumn to your vectors. - Enforce this in the SQL
WHEREclause. Do not rely on application logic alone.
🚀 Final Verdict
For a solo developer, the claude-mem local plugin is sufficient.
But for an enterprise team, you must own the data. Building this MCP server on Postgres gives you:
- Observability: You can query SQL to see what your AI “knows”.
- Portability: You aren’t locked into a proprietary vector DB.
- Control: You can manually
DELETE FROM memorieswhen the AI hallucinates bad patterns.
If you are deploying this for a team of 50+ engineers, you need a managed ingestion pipeline. That is where I come in.
Recommended Reads
How to Build Dyad: The Production-Grade Blueprint
A deep dive into building Dyad-the Dual-Agent Generator/Verifier architecture. Full tech stack breakdown, implementation code, and scaling strategies for enterprise use cases.
How to Build Ai Tutorial Codes Included: The Production-Grade Blueprint
A deep dive into building the execution engine for AI Tutorial Codes Included. Full tech stack breakdown, secure sandboxing implementation, and scaling strategies for enterprise use cases.
How to Build Hands On Large Language Models: The Production-Grade Blueprint
A deep dive into building Hands On Large Language Models. Full tech stack breakdown, implementation code, and scaling strategies for enterprise use cases.