MOS: Self-Hosted LLM Memory with Node.js, pgvector, and Local Embeddings

Long-term context is where many large language model apps quietly fall apart. Memory balloons, prompts bloat, and every tweak risks breaking latency or cost budgets. At AI Tech Inspire, we spotted a small project that takes a refreshingly focused swing at the problem: a decoupled, self-hosted microservice called MOS (MemoryOS) that handles memory for LLM apps with minimal fuss.

What it is in 30 seconds

Architecture: Node.js + TypeScript (Express) backend for I/O; a separate Flask microservice for local embeddings; PostgreSQL with pgvector for vector search.
Embeddings: Local sentence-transformers model (all-MiniLM-L6-v2, 384-dim) to avoid API costs and keep data private.
Ranking: similarity_score = 1 / (1 + similarity_distance), then add a user-defined importance_score to compute combined_score.
Expiration: Optional expires_at per memory; SQL queries filter out expired entries at retrieval time.
Prompt compression: Simple /compress endpoint to merge memory text blocks and reduce prompt size.
Deployment: Fully containerized; a single docker compose up --build brings up DB, embedder, and API with auto-schema init.
Roadmap: Plans to improve text compression and add external authentication; currently no default auth.
Repo: github.com/dhiraj2105/mos

Why this might be the memory layer you actually deploy

Plenty of teams bolt on a RAG stack and immediately run into trade-offs: third-party embedding costs, chat transcripts that never die, and monoliths where CPU-bound embedding jobs choke I/O-bound APIs. MOS addresses those pain points with a few pragmatic choices:

Self-hosted: Keep sensitive text off external APIs and avoid surprise bills.
Decoupled services: I/O and ML heavy lifting are split, making it easier to scale or swap components independently.
Postgres-first: Using pgvector means memory lives where your transactional data might already be—no extra infra like a separate vector DB unless you want one.
Lightweight ranking: Simple scoring + importance lets teams encode domain knowledge without building a full ranking pipeline on day one.

For developers who want a focused memory substrate rather than a full framework, the design goal here is clear: do one thing, do it simply, and run it locally.

Under the hood: simple by design

The backend is a TypeScript Express app that exposes endpoints for storing and retrieving memories. Vector search runs inside PostgreSQL with the pgvector extension, configured for 384-dim embeddings produced by the Flask microservice. The embedder runs all-MiniLM-L6-v2 from Hugging Face locally, which is fast on CPU, small, and cheap to serve. If you need stronger semantic fidelity, you could later swap in a larger sentence-transformers model.

Retrieval ranks candidates with a straightforward formula: compute vector distance, convert to a bounded similarity via 1 / (1 + distance), then add a per-item importance_score. This gives you a knob for domain-specific salience—e.g., system rules or safety policies can outrank casual chat history.

Two additional touches stand out:

Expiration via expires_at, enforced at query time. Useful for time-sensitive tasks, rotating memory, or keeping costs predictable.
Compression via a /compress endpoint. Even a naive merge can pay off when you’re feeding recurring summaries into a GPT-class model with tight context limits.

Key takeaway: MOS favors plain building blocks—local embeddings, pgvector, and simple ranking—over heavy abstractions.

How it compares to the usual suspects

If you’re already deep into LangChain or LlamaIndex, you’ve seen built-in memory components and integrations with vector stores like Pinecone, Weaviate, or Qdrant. Those are great when you need managed scale, hybrid search, or advanced filtering.

MOS is different in a few ways:

Single-purpose microservice rather than a full orchestration framework.
Postgres-native store via pgvector, which can simplify ops in stack-constrained environments.
Local-only embeddings by default. If you normally call cloud embedding APIs (e.g., OpenAI embeddings), this flips the cost/privacy equation.
DIY ranking: easy to reason about, easy to replace. Add MMR or reciprocal rank fusion later if you need diversity.

Think of MOS as a tight, hackable memory substrate that you can drop behind any agent or chat service—especially when you don’t want another SaaS dependency.

Developer scenarios worth trying

Support chatbot with decaying context: Store interactions with expires_at so ephemeral chatter fades while high-importance troubleshooting notes stay sticky.
On-prem LLM assistant: Keep everything local for compliance; sync memory to your existing Postgres backups. Good fit for healthcare, finance, or private R&D.
Agent workspace memory: Track tasks, tools used, and outcomes; mark critical incidents with high importance_score to bias retrieval.
Edge or offline prototyping: Run embeddings on CPU; no internet or API keys needed. Perfect for demos and internal sandboxes.

Trade-offs and gotchas to keep in mind

Ranking calibration: Adding importance_score directly to a bounded similarity term mixes units. Consider normalizing importance (e.g., 0–1) and trying multiplicative blends or RRF/MMR for better balance.
Model choice: all-MiniLM-L6-v2 is fast and small but not state-of-the-art. If you need long documents or multilingual quality, plan for a model upgrade or add a re-ranker.
Auth: There’s no default authentication yet. Put it behind a gateway, or add API keys/JWT before exposing it to the world.
Scaling: pgvector can handle plenty, but for high-scale ANN and hybrid search you may eventually jump to Weaviate/Pinecone or build a path with FAISS.
Latency isolation: The split between API and embedder helps; ensure network hops and backpressure are handled, and cache embeddings where possible.

Practical ideas for the roadmap

Authentication middle layer (reverse proxy with JWT or signed headers, or service-to-service mTLS).
Diversified ranking: MMR, recency decay, or task-aware scoring profiles.
Better compression: semantic merge, graph clustering, or centroid summaries that preserve recall while shrinking tokens.
Observability: per-request latency, hit/miss metrics, and memory-eviction dashboards.
Evaluation harness: retrieval precision/recall benchmarks to compare models and scoring tweaks.

Quickstart: kick the tires locally

Prereqs: Docker and Docker Compose.
Clone: git clone https://github.com/dhiraj2105/mos && cd mos
Run: docker compose up --build
Test a simple endpoint (examples in the repo). When you’re done, stop with Ctrl+C.

Because embedding, API, and database are all in containers, it’s trivial to swap models or scale components independently. If you outgrow CPU embeddings, you can explore GPU-backed containers with CUDA acceleration.

Why this matters now

As LLM apps mature, persistence strategies are becoming as important as prompting tricks. Memory that’s cheap, local, and predictable lets teams experiment with higher-level behaviors—agents that learn, assistants that truly remember, and workflows that evolve. MOS won’t replace full-text search or industrial vector databases, but as a minimal, hackable microservice, it fills a sweet spot that many teams hit during the “make it actually useful” phase.

If your RAG stack feels heavier than your app, consider flipping the script: start with a small, self-hosted memory layer and add complexity only when your metrics demand it.

AI Tech Inspire will be watching how projects like MOS evolve—especially around ranking, compression, and security. For now, it’s a tidy way to give your LLMs a memory that behaves more like a service than a science project.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

ML Foundations (1st Ed.)

Core ML theory.

The Hundred-Page LLMs Book (PyTorch)

Hands-on LLMs.