LLM agents are great at reasoning in the moment, but the moment ends fast. Close the session, and your agent forgets who you are, what you decided, and which rabbit holes you already crawled through. At AI Tech Inspire, we spotted a proposal that tackles this head-on: a fragment-based memory system called Memento that aims to preserve persistent context without the usual noise or compression losses.

Key facts at a glance

  • The author is seeking an arXiv endorsement in cs.AI for a paper on persistent memory for LLM agents.
  • Problem stated: agents lose accumulated context when sessions end.
  • Limitations of common approaches: chunk-level RAG introduces irrelevant noise; summarization loses detail via compression.
  • Memento treats memory as atomic, typed fragments (1–3 sentences each) rather than large document chunks.
  • Six fragment types: Facts, Decisions, Errors, Preferences, Procedures, Relations.
  • Memories decay using rates inspired by Ebbinghaus’s forgetting curve to balance recency and stability.
  • Hybrid retrieval stack: Redis → PostgreSQL GIN → pgvector HNSW, merged via Reciprocal Rank Fusion (RRF).
  • Asynchronous pipeline handles embedding and contradiction detection off the agent’s critical path.
  • Deployed in a personal production setup for software engineering workflows.
  • Reported qualitative gains in retrieval density vs. chunk-level RAG; formal benchmarks are planned.
  • Paper title: “Memento: Fragment-Based Asynchronous Memory Externalization for Persistent Context in Large Language Model Agents.”
  • GitHub: https://github.com/JinHo-von-Choi/memento-mcp.
  • Endorsement link provided by the author: arXiv endorsement (code: ZO7A38).

Why fragments, not chunks?

Most Retrieval-Augmented Generation (RAG) stacks index biggish chunks of text and hope embedding similarity can surface what matters. That works—until it doesn’t. Dense vectors still pull in neighbors that are “close enough,” and then your agent reads three paragraphs to find one relevant sentence. Summarization avoids this by compressing, but at the cost of losing granular details that become critical later.

Memento flips the unit of memory: instead of monolithic chunks, it captures 1–3 sentence fragments crafted to be atomic facts or insights. Those fragments are typed, so retrieval can be both semantic and semantic-with-intent. Whether your agent runs on GPT, a local PyTorch stack, or models from Hugging Face, this idea is model-agnostic: keep small, meaningful memory units that can recombine to reconstruct long-running context precisely when needed.

Key takeaway: better memory isn’t only about stronger embeddings—it’s about making the unit of storage match the unit of reasoning.

Typed memory: six ways to remember

Not all memories are equal, and Memento formalizes that with a six-type taxonomy:

  • Facts: objective statements ("Repository uses Python 3.11.")
  • Decisions: explicit choices ("Adopt FastAPI over Flask for the service.")
  • Errors: known pitfalls ("Previous fix caused a regression in endpoint /v2/search.")
  • Preferences: user or team leanings ("Prefers snake_case in DB schemas.")
  • Procedures: how-tos ("To deploy, run make build && make push.")
  • Relations: links between entities ("Service A depends on Queue B.")

This typing is powerful. Imagine asking your agent, “Remind me of any Errors linked to the search module before I refactor,” or “What Preferences should guide the API design?” Typed recall narrows the focus. It’s easier to filter noise when you can route queries by intent—much like filtering logs by level before sifting through content.

Let memories age (but not vanish)

Memento borrows from the Ebbinghaus forgetting curve to implement tunable decay. In practice, fragments decay at different rates by type. For example, Preferences might decay slowly (they rarely change), whereas transient Facts tied to a work-in-progress might decay faster. Decay can encode recency and certainty without hard deletes. When a fragment reappears (recalled or reinforced), its score is refreshed—similar to spaced repetition.

For developers, this is a helpful alternative to Ctrl+F across an ever-growing vector store. You get a memory that stays relevant because it ages with your workflow.

The three-tier retrieval stack

Memento’s retrieval is layered and deliberately hybrid:

  • Redis: a hot cache for ultra-recent or high-priority fragments. Think of it as your L1 memory—fast lookups for what just happened.
  • PostgreSQL (GIN): exact/lexical search on typed and structured fields. If you want type=Decision or tag=Auth, this is crisp and cheap.
  • pgvector (HNSW): semantic nearest-neighbor retrieval over embeddings when meaning matters more than keywords.

Results are merged with Reciprocal Rank Fusion (RRF), which has a nice property: it respects strong signals from multiple rankers without overfitting to a single modality. A fragment that scores well lexically and semantically bubbles up; a noisy neighbor that wins only on cosine similarity gets tempered by type-filtered signals.

In plain terms: recency + structure + semantics, fused to deliver fewer but more relevant memories into the prompt window.

Async by default: memory off the critical path

Memory work can be expensive. Memento keeps the agent’s main loop responsive by pushing embedding, contradiction detection, and housekeeping to an asynchronous pipeline. The agent writes a candidate memory “receipt,” then moves on. Background workers embed the fragment, check for conflicts (e.g., a new Fact that contradicts an existing one), and apply decay rules. This reduces tail latency and keeps interactive flows snappy—even when the memory graph is large.

Developers shipping assistants for CI, code review, or incident response know this pain: if memory management stalls the agent, the user notices. Asynchrony is a pragmatic choice.

What it looks like in practice

Here’s a compact example of a stored fragment as described in the proposal:

{ "id": "frag_7821", "type": "Decision", "text": "Adopt FastAPI for the API layer.", "entities": ["FastAPI", "API"], "source": "design_review#2026-03-02", "score": 0.91, "decay": { "half_life_days": 120 } }

Compare that to a 2,000-token chunk from a design doc: routing intent through type=Decision and a scoped text field is far more targetable. Retrieval density improves because each returned unit is already semantically minimal.

Where this helps developers right now

  • Long-running coding agents: Remember project-specific Procedures, migration steps, and Errors from prior attempts without re-ingesting the entire repo each time.
  • Productivity copilots: Store Preferences (tone, formatting, stack choices) and retrieve them on demand instead of packing the prompt with repeated instructions.
  • On-call assistants: Fetch recent Decisions and Relations between services to guide remediation under pressure.
  • Design reviewers: Surface past Facts and contradictory fragments while exploring alternatives: “Did we already reject this architecture? Why?”

Many teams bolt memory onto RAG stacks built with frameworks like LangChain or LlamaIndex. Memento’s fragment model could slot in as a specialized memory subsystem—especially if your current pipeline is over-retrieving or losing subtle but important details in summaries.

Open questions and what to watch

  • Evaluation: The author reports qualitative gains in “density” (fewer, more relevant fragments retrieved). Formal, reproducible benchmarks will matter—particularly across tasks (coding, planning, customer support) and models.
  • Contradiction handling: As fragments evolve, how are conflicts resolved—automatic supersedence, voting, or provenance-based arbitration?
  • Fragment granularity: 1–3 sentences is a good default, but task-specific tuning could help. Are Procedures better as multi-step fragments?
  • Decay calibration: Ebbinghaus-inspired decay is intuitive, but per-type half-lives will need careful data-driven tuning.
  • Cost and ops: Async pipelines and multi-store retrieval add complexity. Teams will want guidance on sizing Redis, Postgres, and vector indexes, plus compaction/TTL strategies.

How to try it (and what’s available)

The implementation is available here: github.com/JinHo-von-Choi/memento-mcp. According to the author, it’s already powering personal software engineering workflows. If you’re experimenting with agent memory—especially if your current RAG setup feels noisy—this repository looks like a practical sandbox.

For researchers and practitioners, there’s an arXiv submission in the works under the cs.AI category. The author is seeking endorsement and shared an access link: arXiv endorsement (code: ZO7A38). That’s relevant for those familiar with arXiv’s endorsement model who want to evaluate the work at a manuscript level.


Why it matters

LLM applications are colliding with the constraints of context windows and retrieval noise. Fragment-based memory is a simple but potent idea: make the stored unit match the cognitive unit. When paired with typed fragments, decay, and hybrid retrieval, memory can become a first-class capability rather than a bolted-on vector index.

If your agent keeps “forgetting” what it just learned—or worse, keeps re-reading irrelevant pages—Memento’s approach is worth a close look. As we see more complex agent workflows spanning days or weeks, designs like this could be the difference between an assistant that feels persistent and one that simply pretends to be.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.