How to Build Claude Mem: The Production-Grade Blueprint

Building Claude Mem-a persistent, episodic memory layer for Anthropic’s Claude-is the difference between a chat bot and a Senior AI Engineer. While open-source plugins like claude-mem are excellent for local development, enterprise environments require a stateless, scalable architecture that doesn’t rely on a local SQLite file.

This guide details how to build a Model Context Protocol (MCP) server that acts as a centralized memory brain for your entire engineering team, capable of handling thousands of concurrent sessions with sub-second retrieval latency.

🏗️ The Architecture

We are not just scripting a vector store wrapper; we are engineering a “Memory Sidecar” pattern using the Model Context Protocol (MCP).

The Problem

Standard RAG is static. It retrieves documentation but forgets decisions. If you tell Claude, “We use UUID v7 for primary keys,” standard RAG won’t capture that preference for the next session. Claude Mem solves this by implementing Episodic Memory: recording interactions, decisions, and outcomes in real-time, then compressing them into semantic “memories” that are injected into future contexts.

The Data Flow

Interception: The MCP Server exposes tools (remember_decision, search_history) that Claude uses to explicitly store context.
Ingestion (Hot Path): Memories are written immediately to a write-ahead log (WAL) in Redis for instant availability.
Distillation (Cold Path): An async worker (Celery/Arq) picks up raw logs, uses a small LLM (Claude Haiku) to summarize/deduplicate them, and pushes vectors to PostgreSQL (pgvector).
Retrieval: When a new session starts, the MCP server performs a hybrid search (Semantic + Time-decayed) to load the “Working Set” of memories.

🛠️ The Stack

Interface: Python MCP SDK (Model Context Protocol)
Storage: PostgreSQL + pgvector (The only DB you need for relational + vector)
Queue: Redis (For async summarization buffers)
LLM: Anthropic Claude 3.5 Haiku (For fast, cheap summarization)

💻 Implementation

Below is the core implementation of a Production Memory MCP Server. We use fastmcp (or the standard mcp library) with distinct read/write paths to prevent latency spikes during conversation.

Note: This code assumes you have a Postgres instance with the vector extension enabled.

import os
import json
import asyncio
import logging
from datetime import datetime, timezone
from typing import List, Optional, Dict, Any

# Using the standard MCP Python SDK (2025 Standard)
from mcp.server.fastmcp import FastMCP
from pydantic import BaseModel, Field
import asyncpg
from anthropic import AsyncAnthropic

# Configuration
DB_DSN = os.getenv("DATABASE_URL", "postgresql://user:pass@localhost:5432/memdb")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

# Initialize MCP Server
mcp = FastMCP("EnterpriseClaudeMem")
anthropic = AsyncAnthropic(api_key=ANTHROPIC_API_KEY)

# Logger Setup
logger = logging.getLogger("claude_mem")
logger.setLevel(logging.INFO)

class MemoryItem(BaseModel):
    content: str
    tags: List[str] = []
    confidence_score: float = Field(default=1.0, ge=0.0, le=1.0)

async def get_db_pool():
    """Singleton DB Pool with robust error handling"""
    if not hasattr(get_db_pool, "_pool"):
        try:
            get_db_pool._pool = await asyncpg.create_pool(
                DB_DSN,
                min_size=5,
                max_size=20,
                command_timeout=10
            )
        except Exception as e:
            logger.critical(f"Failed to connect to DB: {e}")
            raise RuntimeError("Database connection failure")
    return get_db_pool._pool

async def generate_embedding(text: str) -> List[float]:
    """
    Mock embedding generation for brevity.
    In production, use Bedrock or OpenAI embeddings.
    """
    # await client.embeddings.create(...)
    return [0.1] * 1536

@mcp.tool()
async def save_memory(content: str, category: str = "general", importance: int = 1) -> str:
    """
    Saves a critical piece of information (decision, preference, or architectural choice)
    to long-term memory.

    Args:
        content: The fact or decision to remember.
        category: 'architecture', 'preference', 'bugfix', or 'business_rule'.
        importance: 1-5 scale of how critical this memory is.
    """
    pool = await get_db_pool()
    embedding = await generate_embedding(content)

    # We use a 'memories' table with HNSW index
    query = """
        INSERT INTO memories (content, category, importance, embedding, created_at)
        VALUES ($1, $2, $3, $4, $5)
        RETURNING id;
    """

    try:
        async with pool.acquire() as conn:
            row_id = await conn.fetchval(
                query,
                content,
                category,
                importance,
                str(embedding), # pgvector often takes string representation or list
                datetime.now(timezone.utc)
            )

        # Trigger async summarization/consolidation here (e.g., via Redis queue)
        # await redis.enqueue("consolidate_memory", row_id)

        return f"Memory stored successfully [ID: {row_id}]. I will recall this in future sessions."

    except Exception as e:
        logger.error(f"Write failed: {e}")
        return "System Error: Failed to write memory. Please try again."

@mcp.tool()
async def search_memory(query: str, limit: int = 5) -> str:
    """
    Retrieves relevant memories based on semantic similarity.
    Use this when you need context about past decisions or project rules.
    """
    pool = await get_db_pool()
    query_embedding = await generate_embedding(query)

    # Hybrid search: Vector Similarity * Importance Booster * Time Decay
    sql = """
        SELECT content, category, created_at,
               (1 - (embedding <=> $1)) as similarity
        FROM memories
        WHERE (1 - (embedding <=> $1)) > 0.7
        ORDER BY similarity DESC, importance DESC
        LIMIT $2;
    """

    try:
        async with pool.acquire() as conn:
            rows = await conn.fetch(sql, str(query_embedding), limit)

        if not rows:
            return "No relevant memories found."

        results = []
        for r in rows:
            age = (datetime.now(timezone.utc) - r['created_at']).days
            results.append(f"[{r['category'].upper()}] {r['content']} (Age: {age} days)")

        return "\n".join(results)

    except Exception as e:
        logger.error(f"Read failed: {e}")
        return "System Error: Memory retrieval unavailable."

@mcp.resource("memory://recent_summary")
async def get_recent_summary() -> str:
    """
    Exposes the 'Working Memory' - a summary of the last 24h of interactions.
    This is auto-injected into Claude's context window by the MCP client.
    """
    pool = await get_db_pool()
    # Fetch high-importance recent items
    rows = await pool.fetch("""
        SELECT content FROM memories
        WHERE created_at > NOW() - INTERVAL '24 hours'
        AND importance >= 3
    """)
    return "\n".join([r['content'] for r in rows]) if rows else "No recent high-priority events."

if __name__ == "__main__":
    # Run the MCP server
    # In production, use: mcp run app:mcp --transport stdio
    mcp.run()

⚠️ Production Pitfalls (The “Senior” Perspective)

When scaling this to an entire engineering org, here is what breaks:

1. The “Context Poisoning” Trap

Issue: If you simply SELECT * and dump memories into context, you will confuse the model with outdated information. Fix: Implement Progressive Disclosure.

Layer 1 (System Prompt): Inject only the “Project Manifesto” (High-level rules).
Layer 2 (Tool Call): Claude must explicitly call search_memory when it lacks context.
Layer 3 (Pruning): Run a nightly cron job using Claude Haiku to merge conflicting memories (e.g., “We use React” vs “We migrated to Svelte”).

2. Latency Spikes

Issue: Generating embeddings on the write path (inside save_memory) adds 500ms+ latency. Fix: Use the Outbox Pattern.

Write the raw text to a Postgres table raw_logs immediately.
Return “Saved” to the user instantly.
Have a background worker (CDC or Polling) pick up raw_logs, generate embeddings, and move them to memories.

3. Multi-Tenancy Security

Issue: Standard vector stores are flat. One team’s “API Key” memory might leak to another team. Fix: Row-Level Security (RLS) in Postgres.

Add a project_id or tenant_id column to your vectors.
Enforce this in the SQL WHERE clause. Do not rely on application logic alone.

🚀 Final Verdict

For a solo developer, the claude-mem local plugin is sufficient. But for an enterprise team, you must own the data. Building this MCP server on Postgres gives you:

Observability: You can query SQL to see what your AI “knows”.
Portability: You aren’t locked into a proprietary vector DB.
Control: You can manually DELETE FROM memories when the AI hallucinates bad patterns.

If you are deploying this for a team of 50+ engineers, you need a managed ingestion pipeline. That is where I come in.

How to Build Claude Mem: The Production-Grade Blueprint

How to Build Claude Mem: The Production-Grade Blueprint

🏗️ The Architecture

The Problem

The Data Flow

🛠️ The Stack

💻 Implementation

⚠️ Production Pitfalls (The “Senior” Perspective)

1. The “Context Poisoning” Trap

2. Latency Spikes

3. Multi-Tenancy Security

🚀 Final Verdict

Recommended Reads

How to Build Dyad: The Production-Grade Blueprint

How to Build Ai Tutorial Codes Included: The Production-Grade Blueprint

How to Build Hands On Large Language Models: The Production-Grade Blueprint