Long-Running Context in AI Systems - Memory, User Modeling, and Persistent State

A Technical Survey for System Architects

1. Introduction and Problem Statement

Large language models (LLMs) are fundamentally stateless. Each inference call processes a fixed-length token sequence—the context window—and produces output with no intrinsic mechanism for retaining information across calls [1]. This constraint creates a hard ceiling on any application requiring continuity: multi-session dialogue, user modeling, evolving problem-solving, or long-horizon agent behavior. The practical consequence is that an LLM-based assistant forgets everything the moment a conversation exceeds its context window or a new session begins.

The field of long-running context in AI concerns itself with overcoming this limitation. It encompasses the design of external memory architectures, retrieval mechanisms, summarization pipelines, and knowledge representations that allow LLM-based systems to maintain coherent state across arbitrarily long interaction histories. The goal is to approximate something analogous to human episodic and semantic memory: the ability to accumulate knowledge, recall relevant prior experience, and update beliefs in response to new information.

This document surveys the current state of the art as of early 2026, covering historical roots, technical architectures, data flows, leading research systems, evaluation benchmarks, and open problems. It is intended for a reader with a computer science background who needs to understand what exists, how it works mechanically, and where the field is heading.

2. Historical Context

2.1 Pre-LLM Precedents

The problem of persistent context in AI systems predates transformer-based models. Key antecedents include:

Cognitive architectures. Systems like Soar (Laird et al., 1987) and ACT-R (Anderson, 1993) implemented explicit episodic and procedural memory stores as part of general-purpose reasoning architectures. Soar’s episodic memory module (Nuxoll and Laird, 2007) stored time-stamped snapshots of working memory contents, retrievable by cue-based matching [7].

Dialogue state tracking. Task-oriented dialogue systems from the 2010s maintained structured belief states across turns—slot-value pairs representing user goals (e.g., destination=London, date=Friday). These were domain-specific and hand-engineered, not general-purpose memory systems.

Memory-augmented neural networks. The Neural Turing Machine (Graves et al., 2014) and Differentiable Neural Computer (Graves et al., 2016) introduced external read-write memory banks accessible via attention-based addressing. These demonstrated the principle that neural networks could learn to use external storage, but they operated at sequence level rather than session level.

2.2 The LLM Era Inflection Point

The arrival of GPT-3 (2020) and subsequent instruction-tuned models created a new class of problem. These models could carry on fluent conversations but had no mechanism for maintaining state beyond a single context window. The tension between their apparent conversational sophistication and their actual amnesia became the driving motivation for the current wave of memory research.

Three developments between 2023 and 2025 defined the modern landscape:

Generative Agents (Park et al., 2023) demonstrated that LLMs coupled with a memory stream, reflection mechanism, and planning module could exhibit believable long-term behavior in simulation [2].
MemGPT (Packer et al., 2023) reframed the problem in operating-systems terms, proposing virtual context management analogous to virtual memory paging [3].
Retrieval-augmented generation (RAG) became the dominant practical approach for injecting external knowledge into LLM prompts, but its limitations for dynamic, evolving memory became increasingly apparent [4].

3. Taxonomy of Memory in LLM Systems

Several classification frameworks have been proposed. The most comprehensive is the eight-quadrant framework of Wu et al. (2025), which organizes memory along three dimensions: object (what is remembered), form (how it is represented), and time (temporal scope) [5]. A simpler and more operationally useful taxonomy, drawn from the survey by Zhang et al. (2024), distinguishes memory by its functional role [1]:

3.1 By Temporal Scope

Working memory (in-context): Information currently within the LLM’s context window. This is the only memory the model can directly attend to during inference. Bounded by context window size (typically 4K–2M tokens in current models).
Short-term memory: Recent interaction history maintained across a small number of turns, typically via conversation buffer or sliding window. Often implemented as a FIFO queue.
Long-term memory: Persistent storage of information across sessions, days, or indefinitely. Requires external storage and a retrieval mechanism to surface relevant content into working memory at inference time.

3.2 By Representation Form

Raw text / episodic: Verbatim or lightly processed records of past interactions. Analogous to human episodic memory. Examples: full conversation logs, timestamped observation records.
Structured / semantic: Extracted facts, entities, and relationships organized in knowledge graphs, relational databases, or structured schemas. Analogous to human semantic memory.
Compressed / summary: Lossy representations produced by summarization—recursive summaries, distilled fact lists, or embeddings. Trades fidelity for token efficiency.
Parametric: Knowledge encoded in model weights via fine-tuning or continual learning. Not externally inspectable or editable without retraining.

3.3 By Operational Role

Episodic memory: Records of specific past events and interactions, with temporal metadata.
Semantic memory: General knowledge and facts extracted from experience, divorced from specific episodes.
Procedural memory: Learned skills, tool-use patterns, or behavioral routines that improve with experience.

4. Core Technical Architectures

This section describes the mechanical operation of the leading systems in detail.

4.1 Generative Agents (Park et al., 2023)

Origin: Stanford / Google. Published at UIST 2023 [2].

Problem addressed: Enabling 25 LLM-powered agents to exhibit believable, coherent behavior over extended simulated time in a sandbox environment (“Smallville”), including forming relationships, planning daily activities, and responding to emergent social situations.

Architecture components:

Memory stream. A time-ordered list of all observations, actions, and reflections as natural language records. Each entry carries metadata: creation timestamp, last-access timestamp, and an importance score (integer, assigned by the LLM at creation time). The memory stream grows monotonically; nothing is deleted.
Retrieval function. When the agent needs context for a decision, it retrieves a subset of the memory stream using a composite scoring function over three factors:
- Recency: Exponential decay on time since last access. Recent memories score higher.
- Importance: The LLM-assigned importance integer. Core life events score higher than mundane observations.
- Relevance: Cosine similarity between the embedding of the current query/situation and each memory’s embedding.
The final retrieval score is a weighted combination: score = α·recency + β·importance + γ·relevance. Top-k memories by this score are injected into the prompt.
Reflection. A second-order memory process. When the sum of importance scores of recent unreflected memories exceeds a threshold (set to 150 in the paper), the agent enters a reflection phase. The 100 most recent memories are sent to the LLM with a prompt requesting the three most salient high-level questions, then relevant memories are retrieved to generate higher-level insight statements (e.g., “Klaus Mueller is dedicated to his research on gentrification”). These reflections are themselves stored in the memory stream and are retrievable in future queries [2][6].
Planning. Top-down recursive plan generation. A broad daily plan (5–8 items) is generated from the agent’s description and a summary of the previous day. This is decomposed to hourly resolution, then to 5–15 minute intervals. Plans are stored in the memory stream and can be revised when the agent encounters unexpected situations [6].

Data flow (simplified):

Perception → Memory Stream (append observation)
                ↓
            Retrieval (recency × importance × relevance)
                ↓
            Context Assembly (retrieved memories + current situation)
                ↓
            LLM Inference → Action/Dialogue/Plan Update
                ↓
            [If importance threshold exceeded] → Reflection → Memory Stream (append reflection)

Key finding from ablation: Removing any of the three components (observation, reflection, planning) significantly degraded the believability of agent behavior as rated by human evaluators. Reflection was particularly important for synthesis and appropriate social responses [2].

4.2 MemGPT (Packer et al., 2023)

Origin: UC Berkeley. Published October 2023 [3].

Problem addressed: Enabling LLMs with fixed context windows to operate over documents and conversations that far exceed their context length, without requiring architectural changes to the model.

Core metaphor: Virtual memory management in operating systems. Just as an OS provides the illusion of abundant memory via paging between RAM and disk, MemGPT provides the illusion of extended context by paging information between the LLM’s context window (“main context”) and external storage (“external context”) [3].

Architecture components:

Main context (analogous to RAM). The LLM’s actual context window, subdivided into three regions:
- System instructions: Fixed prompt describing MemGPT’s functions and control flow.
- Working context: A mutable scratchpad the LLM can edit. Contains core persona information, key facts about the user, and any other information the LLM considers essential to keep immediately accessible.
- FIFO queue: Recent conversation messages. When the queue approaches the context window limit, a memory pressure warning is issued to the LLM, giving it the opportunity to save important information before messages are evicted [3].
External context (analogous to disk). Two databases:
- Archival storage: A read/write database for arbitrary-length text. Supports embedding-based search. Used for long-term storage of facts, reflections, and documents.
- Recall storage: A read-only log of the complete, uncompressed conversation history. Supports timestamp-based, text-based, and embedding-based search [3].
Self-directed memory management. The LLM manages its own memory by generating function calls. Available functions include:
- core_memory_append / core_memory_replace — edit working context
- archival_memory_insert / archival_memory_search — write to / query archival storage
- conversation_search / conversation_search_date — query recall storage
Control flow via interrupts. Borrowing from OS design, MemGPT uses an event-driven control flow. The LLM can chain function calls by emitting a request_heartbeat=true flag, triggering immediate re-inference without waiting for user input. This allows multi-step memory operations within a single user turn [3].
Queue management. When the FIFO queue approaches capacity (e.g., 70% of the context window), the system inserts a memory pressure warning. If the queue exceeds a flush threshold (e.g., 100%), the system evicts messages (e.g., 50% of the queue), generates a recursive summary of the evicted content, and stores evicted messages in recall storage [3].

Data flow:

User Message → FIFO Queue (append)
                  ↓
              [Queue exceeds threshold?] → Memory Pressure Warning → LLM may save to archival
                  ↓
              [Queue exceeds flush limit?] → Summarize + Evict → Recall Storage
                  ↓
              LLM Inference (system instructions + working context + FIFO queue)
                  ↓
              LLM Output → Function Executor
                  ↓
              [Function call?] → Execute function (read/write memory)
                  ↓
              [request_heartbeat?] → Re-invoke LLM (chain functions)
                  ↓
              [No more function calls] → Return response to user

Key technical insight: By making the LLM responsible for its own memory management (rather than relying on a fixed heuristic), MemGPT can adapt its memory strategy to the task. The LLM learns to proactively save important information before it is evicted and to search external context when it recognizes a gap in its current knowledge [3].

4.3 HippoRAG (Gutiérrez et al., 2024) and HippoRAG 2 (Gutiérrez et al., 2025)

Origin: Ohio State University / Stanford. HippoRAG published at NeurIPS 2024 [4]. HippoRAG 2 published 2025 [8].

Problem addressed: Standard RAG encodes passages in isolation as independent vectors, making it unable to integrate knowledge across documents—a capability essential for multi-hop reasoning, literature review, and medical diagnosis. HippoRAG addresses this by drawing on the hippocampal indexing theory of human long-term memory.

Neuroscience grounding: The hippocampal indexing theory posits that the hippocampus does not store memories directly. Instead, it maintains an index of pointers to memory traces distributed across the neocortex. When a cue activates hippocampal index entries, it triggers pattern completion across neocortical regions, reconstructing the full memory. HippoRAG maps this onto a computational architecture [4]:

Biological Component	HippoRAG Analog
Neocortex (stores knowledge)	LLM’s parametric knowledge
Hippocampal index	Knowledge graph (KG)
Pattern separation (encoding)	LLM extracts KG triples from passages
Pattern completion (retrieval)	Personalized PageRank (PPR) over KG
Parahippocampal region (gating)	Passage-to-entity linking via embeddings

Offline indexing process:

For each passage, an LLM extracts open-domain knowledge graph triples (subject, predicate, object).
Extracted entities are linked to existing KG nodes via a parahippocampal region encoder (embedding similarity), enabling synonymy detection and entity resolution.
Passages are connected to their referenced entities via edges in the KG.

Online retrieval process:

Given a query, the LLM extracts named entities from the query.
The parahippocampal encoder links these query entities to KG nodes.
Personalized PageRank is run from the query entity nodes, propagating activation through the KG to find contextually relevant entities and, by extension, the passages that reference them.
Top-ranked passages are returned as retrieval results.

HippoRAG 2 improvements (2025): The original HippoRAG suffered from an entity-centric approach that lost passage-level context during both indexing and retrieval. HippoRAG 2 addresses this with deeper passage integration—passages, queries, and triples are all incorporated into the graph, providing a more comprehensive retrieval signal. It achieves a 7% improvement over prior state-of-the-art embedding models on associative memory tasks while also maintaining strong factual and sense-making performance [8].

Data flow:

[Offline Indexing]
Passage → LLM Triple Extraction → (subject, predicate, object)
                                        ↓
                                   Entity Resolution (embedding matching)
                                        ↓
                                   Knowledge Graph (add nodes + edges)

[Online Retrieval]
Query → LLM Entity Extraction → Query Entities
                                     ↓
                                Link to KG Nodes (embedding matching)
                                     ↓
                                Personalized PageRank → Ranked Entities
                                     ↓
                                Map Entities → Passages → Return top-k

4.4 Zep / Graphiti (Rasmussen et al., 2025)

Origin: Zep AI. Published January 2025 [9].

Problem addressed: Enterprise applications require dynamic knowledge integration from ongoing conversations and structured business data—not just static document retrieval. Existing systems either lose information through recursive summarization (MemGPT) or fail to capture temporal dynamics (standard RAG).

Core component: Graphiti, a temporally-aware knowledge graph engine. Zep is the production memory layer service built on top of Graphiti.

Architecture: three-tier hierarchical knowledge graph G = (N, E, φ):

Episode subgraph (Gₑ). Episodic nodes store raw input data—messages, text, or JSON. These are non-lossy: the complete input record is preserved. Episodic edges connect episodes to the semantic entities they reference [9].
Semantic entity subgraph (Gₛ). Entity nodes represent entities extracted from episodes and resolved against existing graph entities. Semantic edges (also called “facts”) represent relationships between entities. Each edge carries temporal validity metadata [9].
Community subgraph (Gᶜ). Communities are clusters of densely connected entities, identified via graph clustering algorithms. Community nodes store high-level summaries of their constituent entities and relationships, enabling efficient retrieval for broad queries.

Bi-temporal model. Every edge (fact) in the graph carries two temporal dimensions:

Event time (t_valid, t_invalid): When the fact was true in the real world.
Transaction time: When the fact was recorded in the graph.

This enables point-in-time queries (“What was true on date X?”) and audit trails (“When did we learn fact Y?”). When new information contradicts an existing fact, the old edge’s t_invalid is set to the current time. The old fact is preserved, not deleted [9].

Ingestion pipeline:

New episode arrives (message, document, JSON).
LLM extracts entities and relationships.
Entity resolution: new entities are matched against existing KG nodes using semantic, keyword, and graph-based signals.
Conflict detection: new facts are compared against existing facts. If contradictions are detected, temporal metadata is used to invalidate stale facts.
Community update: affected communities are re-summarized.

Retrieval: Combines three search modalities: semantic (embedding similarity), keyword (BM25-style), and graph traversal (multi-hop). Results are re-ranked for relevance and temporal validity.

Benchmark results: On the Deep Memory Retrieval (DMR) benchmark, Zep achieved 94.8% accuracy versus MemGPT’s 93.4%. On LongMemEval, Zep showed accuracy improvements up to 18.5% with 90% latency reduction compared to baseline implementations [9].

4.5 Mem0 (Chhikara et al., 2025)

Origin: Mem0 AI (Y Combinator S24). Published April 2025 [10].

Problem addressed: Production deployment of persistent memory for LLM-based agents, with emphasis on cost efficiency, latency, and scalability.

Architecture: two-phase memory pipeline:

Phase 1 — Extraction: Three context sources are assembled for each new exchange:

The latest user-assistant message pair.
A rolling summary of the conversation (generated asynchronously by a background LLM call).
The m most recent messages (m=10 in their experiments).

An LLM processes these inputs and extracts a concise set of candidate memory facts [10].

Phase 2 — Update: Each candidate memory is compared against the top-s most similar entries in the existing vector store. The LLM then selects one of four operations for each candidate:

Add: Store as a new memory.
Update: Modify an existing memory to incorporate new information.
Delete: Remove an existing memory that is now obsolete.
No-op: The candidate does not warrant any memory change [10].

Mem0ᵍ (graph-enhanced variant): Extends the base architecture with a graph-based memory store. An entity extractor identifies entities as nodes, and a relations generator infers labeled directed edges. A conflict detector flags overlapping or contradictory nodes/edges, and an LLM-powered update resolver decides whether to add, merge, invalidate, or skip graph elements [10].

Benchmark results: On the LOCOMO benchmark, Mem0 achieves a 26% relative improvement in LLM-as-a-Judge accuracy over OpenAI’s built-in memory feature, with 91% lower p95 latency and over 90% token cost savings compared to full-context approaches [10].

5. Evaluation Benchmarks

Evaluating long-term memory is a distinct problem from evaluating single-turn LLM performance. Several benchmarks have been developed:

Benchmark	Source	Characteristics
LoCoMo	Maharana et al. (2024) [11]	Very long-term dialogues (up to 35 sessions, 300 turns, ~9K tokens avg). Tests single-hop, multi-hop, temporal, and adversarial QA over conversation history.
LongMemEval	Wu et al. (2024) [12]	Multi-session evaluations with 500 questions. Tests information extraction, multi-session reasoning, and temporal reasoning across conversations up to 1.5M tokens.
DMR (Deep Memory Retrieval)	Packer et al. (2023) [3]	MemGPT team’s benchmark for factual recall from multi-session conversations.
MemBench	Tan et al. (2025) [13]	Tests information extraction, multi-hop reasoning, knowledge updating, preference following, and temporal reasoning. Finds that existing systems fail to use feedback effectively without forgetting.

A key finding across benchmarks: current systems struggle with temporal reasoning (ordering events, tracking changes in state over time), knowledge updating (incorporating corrections to previously stored facts), and maintaining consistent behavioral profiles across sessions [12][13].

6. Current Research Themes (2024–2026)

6.1 Graph-Based Memory

The strongest current trend is the integration of knowledge graphs into memory architectures. Systems like Zep/Graphiti [9], HippoRAG [4], Mem0ᵍ [10], and GraphMem (2025) use graphs to capture relational structure that vector stores alone cannot represent. GraphMem reported 80% accurate retrieval on the MemoryAgentBench benchmark, representing a 14.9 percentage point improvement over HippoRAG v2 on that benchmark [14].

6.2 Neuroscience-Inspired Architectures

Multiple research groups are explicitly mapping human memory theories onto computational architectures. HippoRAG maps hippocampal indexing theory [4]. The EM-Mem framework by Yin et al. (2024) uses expectation maximization for explicit memory learning [15]. The survey by Wu et al. (2025) proposes a classification framework grounded in cognitive science dimensions [5]. This trend reflects a broader hypothesis: that human memory architecture, shaped by evolutionary pressure, may encode useful inductive biases for AI memory systems.

6.3 Agentic Memory Management

Rather than designing fixed memory management policies, recent work gives agents the ability to manage their own memory. MemGPT [3] pioneered this with self-directed function calls. A-MEM (Xu et al., 2025) takes this further with a Zettelkasten-inspired approach where the agent dynamically organizes memories into interconnected notes [16]. The key advantage is adaptability: the agent can adjust its memory strategy based on task demands without requiring architectural changes.

6.4 Memory in Multi-Agent Systems

The survey by Wu et al. (2025) on LLM-MAS memory [17] identifies a critical distinction between individual agent memory and collective memory. Multi-agent systems require coordination across three contextual layers: individual local context, team joint context, and evolving environment state. Architectures range from fully shared memory pools (enabling “team mind” coordination but risking noise and privacy issues) to fully private memory with explicit communication channels [17].

6.5 Error Propagation and Memory Quality

A 2025 empirical study on memory management [7] identified two key failure modes:

Error propagation: When a faulty memory from an unsuccessful task execution is stored, it can cause the agent to repeat errors in future tasks.
Misaligned experience replay: Memories from dissimilar contexts are retrieved and applied inappropriately.

The study found that evaluator reliability is critical for memory quality—selective memory addition based on strict evaluation outperforms indiscriminate storage [7].

6.6 Context Engineering

An emerging framing (2025–2026) repositions the problem as “context engineering”: the systematic design of what information enters the LLM’s context window, when, and in what form. This encompasses memory retrieval, prompt construction, tool result formatting, and dynamic context prioritization. It subsumes memory as one component of a larger system design challenge.

7. Comparative Analysis of Approaches

System	Memory Representation	Temporal Awareness	Write Mechanism	Read Mechanism	Scalability
Generative Agents [2]	Natural language stream	Timestamps, recency decay	Append-only	Weighted retrieval (recency × importance × relevance)	Limited (full stream in memory)
MemGPT [3]	Working context + archival DB + recall log	Timestamps	LLM self-directed function calls	LLM-initiated search (text, embedding, timestamp)	Moderate (DB-backed)
HippoRAG [4]	Knowledge graph + passages	Implicit (via passage ordering)	LLM triple extraction + entity resolution	Personalized PageRank	Good (graph scales with data)
Zep/Graphiti [9]	Temporal knowledge graph (3-tier)	Bi-temporal model (event time + transaction time)	LLM extraction + entity resolution + conflict detection	Hybrid (semantic + keyword + graph traversal)	Production-grade
Mem0 [10]	Vector store + optional graph	Implicit	LLM extraction + 4-operation update	Vector similarity + graph traversal	Production-grade

8. Open Problems

Forgetting and decay. Human memory naturally decays. Most current AI memory systems either keep everything forever (growing storage and retrieval noise) or use crude heuristics for deletion. Principled forgetting policies remain an open problem.

Contradiction resolution. When new information contradicts stored memories, systems must decide what to keep. Zep’s temporal invalidation is one approach, but robust contradiction detection across complex, multi-hop relationships remains unsolved.

Privacy. Persistent memory introduces privacy risks. The paper “Unveiling Privacy Risks in LLM Agent Memory” (2025) examined how stored memories could leak sensitive user information through subsequent interactions [18].

Evaluation. Existing benchmarks primarily test factual recall. Evaluating whether an agent has formed appropriate abstractions, maintained consistent personality, or updated beliefs appropriately is much harder and less well-served by current metrics.

Cross-modal memory. Most current systems handle text only. Extending memory architectures to images, audio, and other modalities is an active area (e.g., MemVerse by Liu et al., 2025) [19].

9. Glossary

Archival storage. In MemGPT, a persistent read/write database for long-term information, searchable by text or embedding. Analogous to disk storage in an OS.

Bi-temporal model. A data modeling approach that tracks two independent time dimensions: when a fact was true (event time) and when it was recorded (transaction time). Used by Zep/Graphiti.

Context window. The maximum number of tokens an LLM can process in a single inference call. Information outside this window is invisible to the model.

Episodic memory. Memory of specific events and experiences, typically with temporal and contextual metadata. Contrasted with semantic memory (general facts).

FIFO queue. First-in-first-out buffer. In MemGPT, the most recent conversation messages are maintained in a FIFO queue within the context window.

Hippocampal indexing theory. A neuroscience theory positing that the hippocampus maintains an index of pointers to distributed neocortical memory traces, rather than storing memories directly.

Knowledge graph (KG). A graph data structure where nodes represent entities and edges represent relationships between them. Used by HippoRAG, Zep, and Mem0ᵍ for structured memory.

Main context. In MemGPT, the information currently within the LLM’s context window. Analogous to RAM/physical memory in an OS.

Memory pressure. In MemGPT, a warning issued to the LLM when the FIFO queue approaches the context window limit, giving the LLM an opportunity to save important information before eviction.

Memory stream. In Generative Agents, a time-ordered append-only list of all observations, actions, and reflections in natural language.

Parametric memory. Knowledge encoded in model weights, as opposed to external memory stores. Modified only through training/fine-tuning, not through runtime read/write operations.

Personalized PageRank (PPR). A variant of the PageRank algorithm that computes node importance relative to a specific set of seed nodes (in HippoRAG, the query entities).

Recall storage. In MemGPT, a read-only log of the complete uncompressed conversation history, including system messages and tool calls.

Reflection. In Generative Agents, a process by which the agent synthesizes higher-level insights from recent memories. Triggered when accumulated importance scores exceed a threshold.

Retrieval-augmented generation (RAG). A technique where relevant documents or passages are retrieved from an external store and injected into the LLM’s prompt to provide context not present in the model’s parametric knowledge.

Semantic memory. General knowledge and facts, abstracted from specific episodes. In AI memory systems, typically represented as structured data (KG triples, fact lists).

Virtual context management. MemGPT’s core technique: using the LLM’s own function-calling capability to page information between the context window and external storage, analogous to OS virtual memory.

Working context. In MemGPT, a mutable section of the context window that the LLM can edit via function calls. Contains core persona and user information.

10. Bibliography

[1] Zhang, Z., Bo, X., Ma, C., Li, R., Chen, X., Dai, Q., Zhu, J., Dong, Z., & Wen, J. “A Survey on the Memory Mechanism of Large Language Model-based Agents.” ACM Transactions on Information Systems, 2024. [Verified] https://dl.acm.org/doi/10.1145/3748302

[2] Park, J.S., O’Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. “Generative Agents: Interactive Simulacra of Human Behavior.” Proceedings of UIST ‘23, ACM, 2023. [Verified] https://arxiv.org/abs/2304.03442

[3] Packer, C., Fang, V., Patil, S., Lin, K., Wooders, S., & Gonzalez, J. “MemGPT: Towards LLMs as Operating Systems.” arXiv:2310.08560, 2023. [Verified] https://arxiv.org/abs/2310.08560

[4] Gutiérrez, B.J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. “HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models.” NeurIPS 2024. [Verified] https://arxiv.org/abs/2405.14831

[5] Wu, Y., Liang, S., Zhang, C., Wang, Y., Zhang, Y., Guo, H., Tang, R., & Liu, Y. “From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs.” arXiv:2504.15965, 2025. [Snippet] https://arxiv.org/html/2504.15965v2 Search evidence: Query “LLM long-term memory systems survey 2024 2025” returned snippet describing “eight-quadrant classification framework grounded in three dimensions: object, form, and time.”

[6] Park, J.S. et al., as reviewed in GonzoML. “Generative Agents: Interactive Simulacra of Human Behavior.” 2023. [Snippet] https://gonzoml.substack.com/p/generative-agents-interactive-simulacra Search evidence: Query “Generative Agents Park simulacra” returned snippet: “Reflection created periodically when importance scores exceed threshold… 100 most recent memories.”

[7] Xu, W., Mei, K., Gao, H., Tan, J., Liang, Z., & Zhang, Y. “How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior.” arXiv:2505.16067, 2025. [Snippet] https://arxiv.org/html/2505.16067v2 Search evidence: Query “LLM long-term memory systems survey 2024 2025” returned snippet discussing error propagation and misaligned experience replay in memory management.

[8] Gutiérrez, B.J. et al. “From RAG to Memory: Non-Parametric Continual Learning for Large Language Models.” ICML 2025. [Snippet] https://arxiv.org/html/2502.14802v1 Search evidence: Query “HippoRAG neurobiologically inspired” returned snippet: “HippoRAG 2… achieves 7% improvement in associative memory tasks.”

[9] Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. “Zep: A Temporal Knowledge Graph Architecture for Agent Memory.” arXiv:2501.13956, 2025. [Verified] https://arxiv.org/abs/2501.13956

[10] Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. “Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.” arXiv:2504.19413, 2025. [Verified] https://arxiv.org/abs/2504.19413

[11] Maharana, A., Lee, D., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. “Evaluating Very Long-Term Conversational Memory of LLM Agents.” arXiv:2402.17753, 2024. [Verified] https://snap-research.github.io/locomo/

[12] Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K., & Yu, D. “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.” arXiv:2410.10813, 2024. [Snippet] https://arxiv.org/html/2512.12818v1 Search evidence: Referenced in “Hindsight is 20/20” survey: “LongMemEval tests information extraction, multi-session reasoning, and temporal reasoning across conversations featuring up to 1.5 million tokens.”

[13] Tan et al. “MemBench: Evaluating the Memory of LLM-Based Agents.” 2025. [Snippet] https://www.arxiv.org/pdf/2510.27246 Search evidence: Referenced in “Beyond a Million Tokens” paper: “MemBench evaluates memory of LLM-based agents on information extraction, multi-hop reasoning, knowledge updating, preference following, and temporal reasoning.”

[14] GraphMem. Referenced in HippoRAG ResearchGate page. 2025. [Snippet] https://www.researchgate.net/publication/380847724_HippoRAG_Neurobiologically_Inspired_Long-Term_Memory_for_Large_Language_Models Search evidence: “GraphMem achieves state-of-the-art results, including eighty percent accurate retrieval, representing a fourteen point nine percentage point improvement over HippoRAG version two.”

[15] Yin, Z., Sun, Q., Guo, Q., Zeng, Z., Cheng, Q., Qiu, X., & Huang, X. “Explicit Memory Learning with Expectation Maximization.” Proceedings of EMNLP 2024, pp. 16618–16635. [Snippet] https://arxiv.org/html/2505.16067v2 Search evidence: Referenced in memory management paper’s bibliography.

[16] Xu, W., Mei, K., Gao, H., Tan, J., Liang, Z., & Zhang, Y. “A-MEM: Agentic Memory for LLM Agents.” arXiv:2502.12110, 2025. [Snippet] https://arxiv.org/html/2505.16067v2 Search evidence: Referenced in memory management paper’s bibliography.

[17] Wu, S. et al. “Memory in LLM-based Multi-agent Systems: Mechanisms, Challenges, and Collective.” TechRxiv preprint, 2025. [Snippet] https://www.techrxiv.org/users/1007269/articles/1367390/master/file/data/LLM_MAS_Memory_Survey_preprint_/LLM_MAS_Memory_Survey_preprint_.pdf Search evidence: Query “LLM long-term memory systems survey” returned snippet describing shared memory pool, hybrid topology, and multi-contextual coordination layers.

[18] “Unveiling Privacy Risks in LLM Agent Memory.” 2025. [Snippet] https://github.com/Shichun-Liu/Agent-Memory-Paper-List Search evidence: Listed in Agent-Memory-Paper-List repository, February 2025 entry.

[19] Liu, J. et al. “MemVerse: Multimodal Memory for Lifelong Learning Agents.” arXiv:2512.03627, 2025. [Snippet] https://arxiv.org/html/2512.12818v1 Search evidence: Referenced in “Hindsight is 20/20” bibliography.