Long-term context is where many large language model apps quietly fall apart. Memory balloons, prompts bloat, and every tweak risks breaking latency or cost budgets. At AI Tech Inspire, we spotted a small project that takes a refreshingly focused swing at the problem: a decoupled, self-hosted microservice called MOS (MemoryOS) that handles memory for LLM apps with minimal fuss.
What it is in 30 seconds
- Architecture:
Node.js+ TypeScript (Express) backend for I/O; a separate Flask microservice for local embeddings; PostgreSQL with pgvector for vector search. - Embeddings: Local sentence-transformers model (all-MiniLM-L6-v2, 384-dim) to avoid API costs and keep data private.
- Ranking:
similarity_score = 1 / (1 + similarity_distance), then add a user-definedimportance_scoreto computecombined_score. - Expiration: Optional
expires_atper memory; SQL queries filter out expired entries at retrieval time. - Prompt compression: Simple
/compressendpoint to merge memory text blocks and reduce prompt size. - Deployment: Fully containerized; a single
docker compose up --buildbrings up DB, embedder, and API with auto-schema init. - Roadmap: Plans to improve text compression and add external authentication; currently no default auth.
- Repo: github.com/dhiraj2105/mos
Why this might be the memory layer you actually deploy
Plenty of teams bolt on a RAG stack and immediately run into trade-offs: third-party embedding costs, chat transcripts that never die, and monoliths where CPU-bound embedding jobs choke I/O-bound APIs. MOS addresses those pain points with a few pragmatic choices:
- Self-hosted: Keep sensitive text off external APIs and avoid surprise bills.
- Decoupled services: I/O and ML heavy lifting are split, making it easier to scale or swap components independently.
- Postgres-first: Using
pgvectormeans memory lives where your transactional data might already be—no extra infra like a separate vector DB unless you want one. - Lightweight ranking: Simple scoring + importance lets teams encode domain knowledge without building a full ranking pipeline on day one.
For developers who want a focused memory substrate rather than a full framework, the design goal here is clear: do one thing, do it simply, and run it locally.
Under the hood: simple by design
The backend is a TypeScript Express app that exposes endpoints for storing and retrieving memories. Vector search runs inside PostgreSQL with the pgvector extension, configured for 384-dim embeddings produced by the Flask microservice. The embedder runs all-MiniLM-L6-v2 from Hugging Face locally, which is fast on CPU, small, and cheap to serve. If you need stronger semantic fidelity, you could later swap in a larger sentence-transformers model.
Retrieval ranks candidates with a straightforward formula: compute vector distance, convert to a bounded similarity via 1 / (1 + distance), then add a per-item importance_score. This gives you a knob for domain-specific salience—e.g., system rules or safety policies can outrank casual chat history.
Two additional touches stand out:
- Expiration via
expires_at, enforced at query time. Useful for time-sensitive tasks, rotating memory, or keeping costs predictable. - Compression via a
/compressendpoint. Even a naive merge can pay off when you’re feeding recurring summaries into a GPT-class model with tight context limits.
Key takeaway: MOS favors plain building blocks—local embeddings, pgvector, and simple ranking—over heavy abstractions.
How it compares to the usual suspects
If you’re already deep into LangChain or LlamaIndex, you’ve seen built-in memory components and integrations with vector stores like Pinecone, Weaviate, or Qdrant. Those are great when you need managed scale, hybrid search, or advanced filtering.
MOS is different in a few ways:
- Single-purpose microservice rather than a full orchestration framework.
- Postgres-native store via
pgvector, which can simplify ops in stack-constrained environments. - Local-only embeddings by default. If you normally call cloud embedding APIs (e.g., OpenAI embeddings), this flips the cost/privacy equation.
- DIY ranking: easy to reason about, easy to replace. Add
MMRor reciprocal rank fusion later if you need diversity.
Think of MOS as a tight, hackable memory substrate that you can drop behind any agent or chat service—especially when you don’t want another SaaS dependency.
Developer scenarios worth trying
- Support chatbot with decaying context: Store interactions with
expires_atso ephemeral chatter fades while high-importance troubleshooting notes stay sticky. - On-prem LLM assistant: Keep everything local for compliance; sync memory to your existing
Postgresbackups. Good fit for healthcare, finance, or private R&D. - Agent workspace memory: Track tasks, tools used, and outcomes; mark critical incidents with high
importance_scoreto bias retrieval. - Edge or offline prototyping: Run embeddings on CPU; no internet or API keys needed. Perfect for demos and internal sandboxes.
Trade-offs and gotchas to keep in mind
- Ranking calibration: Adding
importance_scoredirectly to a bounded similarity term mixes units. Consider normalizing importance (e.g., 0–1) and trying multiplicative blends orRRF/MMRfor better balance. - Model choice:
all-MiniLM-L6-v2is fast and small but not state-of-the-art. If you need long documents or multilingual quality, plan for a model upgrade or add a re-ranker. - Auth: There’s no default authentication yet. Put it behind a gateway, or add
API keys/JWTbefore exposing it to the world. - Scaling:
pgvectorcan handle plenty, but for high-scale ANN and hybrid search you may eventually jump to Weaviate/Pinecone or build a path with FAISS. - Latency isolation: The split between API and embedder helps; ensure network hops and backpressure are handled, and cache embeddings where possible.
Practical ideas for the roadmap
- Authentication middle layer (reverse proxy with
JWTor signed headers, or service-to-service mTLS). - Diversified ranking:
MMR, recency decay, or task-aware scoring profiles. - Better compression: semantic merge, graph clustering, or centroid summaries that preserve recall while shrinking tokens.
- Observability: per-request latency, hit/miss metrics, and memory-eviction dashboards.
- Evaluation harness: retrieval precision/recall benchmarks to compare models and scoring tweaks.
Quickstart: kick the tires locally
- Prereqs: Docker and Docker Compose.
- Clone:
git clone https://github.com/dhiraj2105/mos && cd mos - Run:
docker compose up --build - Test a simple endpoint (examples in the repo). When you’re done, stop with Ctrl+C.
Because embedding, API, and database are all in containers, it’s trivial to swap models or scale components independently. If you outgrow CPU embeddings, you can explore GPU-backed containers with CUDA acceleration.
Why this matters now
As LLM apps mature, persistence strategies are becoming as important as prompting tricks. Memory that’s cheap, local, and predictable lets teams experiment with higher-level behaviors—agents that learn, assistants that truly remember, and workflows that evolve. MOS won’t replace full-text search or industrial vector databases, but as a minimal, hackable microservice, it fills a sweet spot that many teams hit during the “make it actually useful” phase.
If your RAG stack feels heavier than your app, consider flipping the script: start with a small, self-hosted memory layer and add complexity only when your metrics demand it.
AI Tech Inspire will be watching how projects like MOS evolve—especially around ranking, compression, and security. For now, it’s a tidy way to give your LLMs a memory that behaves more like a service than a science project.
Recommended Resources
As an Amazon Associate, I earn from qualifying purchases.