Modeling a User as a Fine-Tuned LLM - The Weights-as-State Architecture

Scope: Two-LLM architectures where a fine-tuned User-Model LLM serves as the persistent representation of an individual user, queried by an Agent LLM at inference time

1. The Core Idea

The architecture under discussion is:

  • An Agent LLM serves the user — answering questions, executing tasks, generating content.
  • A User-Model LLM is a separate model, fine-tuned on the specific user’s data, whose weights encode that user — their preferences, reasoning patterns, knowledge, writing style, decision tendencies, and personal history.
  • The Agent LLM queries the User-Model LLM as a personalization oracle: “What does this user know about X?”, “How would this user phrase this?”, “What context should I add to this query on behalf of this user?”
  • The User-Model LLM’s parameters are the persistent state. There is no external memory database, no retrieval-augmented generation pipeline, no profile text file. The user is the model. The weights are the memory.

This is distinct from other personalization strategies (RAG, prompt injection, memory layers) in a fundamental way: those approaches store user information outside the model and inject it at inference time. This approach stores user information inside the model, compressed into its parameters via training. The term of art for this is memory parameterization [1].


2. Why Parameterize Instead of Retrieve?

The existing approaches to user-specific personalization all share a common structure: user data lives in an external store (vector DB, text profile, knowledge graph) and is retrieved and injected into the context window at query time. This creates three problems that memory parameterization addresses:

Context window competition. Every token of user context injected into the prompt is a token unavailable for the actual task. As user histories grow, the ratio of personalization-context to task-context degrades. Shang et al. (2024) demonstrated empirically that even frontier models like GPT-4o cannot reliably extract and reason over relevant information from long contexts — their “Reasoning-in-a-Haystack” experiments showed near-total failure in the long-context regime [1][2].

Retrieval fragility. RAG-based personalization depends on the retriever surfacing the right fragments. If the retriever misses relevant history or returns irrelevant fragments, the personalization fails silently. Retrieval also breaks the continuity of a user’s history — it returns isolated chunks, not an integrated understanding of the user [3].

No generalization. A retrieval system can only return what was explicitly stored. It cannot generalize from a user’s past behavior to predict preferences in novel situations. A parameterized model can, because the training process forces the model to learn patterns that generalize beyond the training examples — the same property that makes LLMs useful for any task [1].

Memory parameterization addresses all three: the user’s knowledge is compressed into the model’s weights, consuming zero tokens at inference time; there is no retrieval step to fail; and the fine-tuned model can generalize from observed patterns to novel queries.


3. The Large Personal Model (LPM) Framework

The most developed articulation of this architecture comes from Shang et al. (2024) and Wei et al. (2025), who define a three-layer memory hierarchy and build a complete system called Second Me around it [1][2][4].

3.1 Three-Layer Architecture

LayerNameStorage MediumFunction
L0Raw Data LayerVector DB / file systemUnstructured user data (documents, chat logs, images). Equivalent to applying standard RAG to raw data.
L1Natural Language MemoryStructured text storeFacts, preferences, biographical summaries expressed in natural language. Searchable, human-readable.
L2AI-Native MemoryModel parametersMemory that does not require natural language description. Learned and organized through model weights. Each user’s L2 is a neural network — the Large Personal Model (LPM).

The key claim of the LPM framework: L2 memory — knowledge encoded in model parameters — can do things that L0 and L1 cannot. It can synthesize across disparate facts, generalize to unseen queries, and reason over compressed representations without consuming context tokens [1][2].

3.2 How L2 Interacts with the Agent LLM

In the Second Me system, the L2 model does not replace the Agent LLM. It serves as a context provider — a bridge between the user and external AI systems [4]. Three deployment scenarios define this interaction:

Memory QA. The Agent LLM (or the user directly) queries the User-Model LLM about the user. “What are this user’s dietary restrictions?” “What programming languages does this user prefer?” The L2 model responds from its parameterized knowledge. This can operate in first-person mode (serving the user directly) or third-party mode (representing the user to an external system) [4].

Context Enhancement. When the user sends a query to an expert Agent LLM, the User-Model LLM enriches the query with relevant personal context before it reaches the agent. Example: the user asks “recommend resources for two-stage model training.” The User-Model LLM, knowing the user has been studying deepseek-coder-v2’s training pipeline and is familiar with GRPO and DPO, adds that context to the query automatically [4].

Context Critique. During an interaction with an external agent, the User-Model LLM evaluates the agent’s response against what it knows about the user’s needs and preferences, providing corrective feedback [4].

3.3 Data Flow

┌──────────────────────────────────────────────────────────────┐
│                    USER'S RAW DATA                           │
│  Chat logs, documents, emails, notes, behavioral signals     │
└───────────────────────┬──────────────────────────────────────┘
                        │
                        ▼
┌──────────────────────────────────────────────────────────────┐
│              DATA MINING & SYNTHESIS                          │
│  1. Entity extraction, topic modeling, relationship mining    │
│  2. Synthetic training data generation:                      │
│     - Memory QA pairs (self + third-party perspectives)      │
│     - Context enhancement scenarios                          │
│     - Context critique scenarios                             │
│  3. Chain-of-Thought (COT) formatting                        │
│  4. Five-level quality filtering                             │
└───────────────────────┬──────────────────────────────────────┘
                        │
                        ▼
┌──────────────────────────────────────────────────────────────┐
│              TRAINING (per-user, isolated)                    │
│                                                              │
│  Base model: e.g. Qwen2.5-7B-Instruct (frozen)              │
│  Method: PEFT (LoRA) → SFT on synthetic data                │
│  Then: DPO on ~20% preference pairs derived from SFT model  │
│  Output: per-user LoRA adapter (~10-100 MB)                  │
│                                                              │
│  ┌────────────────────────────────────────────────────┐      │
│  │  THE USER-MODEL LLM                                │      │
│  │  Base weights (shared, frozen)                     │      │
│  │  + User-specific LoRA adapter (the "user")         │      │
│  │  = A model that IS this user's memory              │      │
│  └────────────────────────────────────────────────────┘      │
└───────────────────────┬──────────────────────────────────────┘
                        │
                        ▼
┌──────────────────────────────────────────────────────────────┐
│              INFERENCE (hybrid architecture)                  │
│                                                              │
│  ┌────────────┐     queries      ┌──────────────────┐       │
│  │ Agent LLM  │ ───────────────→ │ User-Model LLM   │       │
│  │ (task      │ ←─────────────── │ (L2 / LPM)       │       │
│  │  executor) │   user context   │                   │       │
│  └─────┬──────┘                  └──────────────────┘       │
│        │                                                     │
│        │ also queries L0 (RAG) and L1 (text facts)          │
│        │ when needed — L2 orchestrates                       │
│        ▼                                                     │
│  Response to user                                            │
└──────────────────────────────────────────────────────────────┘

4. The Training Pipeline in Detail

The Second Me training pipeline (Wei et al., 2025) is the most complete published implementation of this architecture [4]. The pipeline is fully automated and requires no manual annotation.

4.1 Data Sources

Raw input is whatever the user has generated: chat histories, documents, notes, emails, social media posts. This data is sparse and unstructured — a typical user does not generate millions of tokens of neatly formatted training data. The pipeline must therefore synthesize training data from sparse raw inputs.

4.2 Data Synthesis

A teacher LLM (e.g., GPT-4o or DeepSeek-R1) is used to generate synthetic training pairs from the user’s raw data [4]. The synthesis strategies include:

  • Memory QA pairs: Given the user’s data, generate questions someone might ask about the user (or the user might ask themselves), and generate answers grounded in the raw data.
  • Context enhancement pairs: Given a generic query, generate an enriched version that incorporates what the User-Model should know about the user.
  • Context critique pairs: Given an external agent’s response, generate feedback that incorporates the user’s specific needs and preferences.
  • Chain-of-Thought formatting: Training pairs are formatted with explicit reasoning chains. Wei et al. tested three COT strategies — weak (unstructured), multi-step (enforced reasoning then answer), and strong (using DeepSeek-R1 with strict format constraints). Strong COT achieved the highest Memory(Self) score at 0.91 [4].

4.3 Quality Filtering

A five-level filtering process removes low-quality synthetic data before training. This is critical because synthetic data amplifies errors if unfiltered [4].

4.4 Training

Training proceeds in two stages on a per-user basis, with each user’s data strictly isolated [4]:

Stage 1 — Supervised Fine-Tuning (SFT): The base model (Qwen2.5-7B-Instruct in Second Me) is fine-tuned using LoRA on the filtered synthetic data. The base model weights remain frozen; only the LoRA adapter weights are trained.

Stage 2 — Direct Preference Optimization (DPO): After SFT, the fine-tuned model generates responses to a subset of queries. These responses are scored against the user’s actual data to create preference pairs (better response vs. worse response). DPO training then refines the model to better align with user-specific preferences. DPO data constitutes approximately 20% of the total SFT training data volume [4].

The DPO stage does not introduce new knowledge; it refines the model’s understanding of the user’s priorities among entities and relationships it already knows from SFT [4].

4.5 Evaluation

Wei et al. define four automated evaluation metrics [4]:

MetricWhat It MeasuresScore Range
Memory (Self)First-person interaction: can the model answer questions about the user?0–1
Memory (Third-party)Third-person interaction: can the model represent the user to others?0–1
Context EnhanceCan the model enrich a query with relevant user context?0–1, three levels
Context CriticCan the model critique external responses using user context?0–1, five levels

Best results (Strong COT + SFT + DPO): Memory(Self) = 0.91, Context Enhance = 0.75, Context Critic = 0.85 [4].


5.1 Adversarial Contrastive Distillation (ACD)

Kannan (2025) demonstrates a self-supervised pipeline for fine-tuning a model to adopt a specific user’s persona from chat history [5]. A teacher model (GPT-4o-mini) generates contrarian examples — responses that deliberately differ from how the user would respond. These contrastive pairs are used to fine-tune a student model (DistilBERT + LoRA) to distinguish the user’s authentic style from alternatives. On WhatsApp chat history, this achieved 85.69% accuracy at identifying the user’s responses vs. 66.91% for the base model [5].

The ACD approach is significant because it solves the data annotation problem for per-user fine-tuning: no human labeling is required. The teacher model generates both positive and negative examples automatically.

5.2 Persona-Plug (PPlug)

Liu et al. (2024) take an intermediate approach: rather than fine-tuning a separate LLM per user, they train a lightweight user embedder module that compresses the user’s entire history into a single dense embedding vector [3]. This vector is prepended to the Agent LLM’s input, conditioning its output on the user. The base LLM’s parameters are never modified.

This is not the full weights-as-state approach (the user is represented as a vector, not as model weights), but it sits on the continuum between RAG-style retrieval and full parameterization. It outperforms retrieval-based personalization on the LaMP benchmark [3].

5.3 AI PERSONA: Life-Long Personalization

Wang et al. (2024) introduce the concept of life-long personalization — the User-Model must update continuously as the user’s preferences and context evolve [6]. Their framework uses a persona optimizer that, every k sessions, processes recent dialogue history and outputs revised persona fields. These are stored as a dictionary and prepended at every inference step [7].

This is prompt-level rather than weights-level, but it formalizes the problem that any weights-as-state system must solve: how to update the user model without catastrophic forgetting and without requiring full retraining.

5.4 CharLoRA / CharacterBot

Wang et al. (2025) use low-rank adapter modules (CharLoRA) to encode both surface-level and deep persona knowledge into a model, supporting multi-task adaptation across different facets of a user’s personality [7]. This demonstrates that LoRA adapters can encode personality structure, not just factual preferences.


6. Open Technical Problems

6.1 Cold Start

A User-Model LLM requires sufficient user data to train. For new users, there is no data. The LPM framework acknowledges this and proposes falling back to L1 (text-based personalization) until enough data accumulates, or using role-play methods to generate synthetic seed data [1].

6.2 Catastrophic Forgetting

When the model is retrained on new user data, it risks forgetting previously learned information. This is the standard catastrophic forgetting problem in continual learning, but applied to a single user’s evolving identity. Existing work on continual learning (elastic weight consolidation, progressive networks) is relevant but not yet validated in the per-user LPM context [1].

6.3 Contradiction Resolution

Users change their minds. A user who was vegetarian in January may not be in June. The model must learn to override old facts with new ones without destabilizing unrelated knowledge. Wei et al. note this as an open challenge; Second Me’s DPO stage provides some implicit handling by reweighting preferences, but it is not a systematic solution [1][4].

6.4 Scalability

Each user requires a per-user LoRA adapter. At 10–100 MB per adapter, serving millions of users requires terabytes of adapter storage and a multi-LoRA serving infrastructure capable of loading the correct adapter per request. This is an engineering problem with existing solutions (LoRAX, vLLM multi-LoRA, S-LoRA) but remains expensive at scale.

6.5 Evaluation

Current benchmarks (LaMP, LOCOMO) do not adequately test whether a model has truly learned to be a user versus simply memorizing surface-level facts. Wang et al. (2024) argue that existing personalization benchmarks are insufficient because non-personalized LLMs achieve competitive performance on them, suggesting the benchmarks do not actually require deep user understanding [6].

BehaviorChain (Li et al., 2025) is the first benchmark specifically testing whether LLMs can simulate continuous human behavior (15,846 behaviors across 1,001 personas). Even state-of-the-art models show significant gaps [8].

6.6 Privacy

A model fine-tuned on user data is user data. It can be prompted to reveal facts about the user that the user might not want disclosed. The LPM framework mandates local/on-device deployment to address this [4], but adversarial extraction of memorized training data from fine-tuned models is a known vulnerability in the broader LLM security literature.


7. Current Status of the Field (March 2026)

The weights-as-state approach to user modeling is at an early research stage with one major open-source implementation (Second Me / Mindverse, GitHub: github.com/Mindverse/Second-Me). The key milestones:

  • June 2024: Shang et al. publish “AI-native Memory: A Pathway from LLMs Towards AGI,” formally defining the LPM concept and the L0/L1/L2 hierarchy [1].
  • December 2024: Wang et al. introduce AI PERSONA and the life-long personalization task formulation [6].
  • March 2025: Wei et al. publish “AI-native Memory 2.0: Second Me” with the complete automated training pipeline, evaluation metrics, and open-source deployment system [4].
  • 2025 (ongoing): The broader field of persona-based LLM systems is converging on the question of how to represent users — with approaches ranging from prompt-level persona descriptions, through learned user embeddings (Persona-Plug), to full parameterization (LPM/Second Me) [7].

The commercial landscape is dominated by RAG-based and prompt-injection-based memory systems (Mem0, Letta/MemGPT, LangMem, ChatGPT Memory, Claude Memory). These are simpler to deploy but face the fundamental limitations described in Section 2. The weights-as-state approach remains primarily academic, with Second Me as the most complete implementation.


8. Glossary

TermDefinition
Agent LLMThe task-executing LLM that serves the user. It queries the User-Model LLM for personalization context.
AI-native memoryMemory encoded in model parameters rather than in external data stores. The model’s weights are the memory. Coined by Shang et al. (2024).
ACDAdversarial Contrastive Distillation. A self-supervised method for training a model to adopt a user’s persona by generating contrarian examples and training on the contrast.
CharLoRAA variant of LoRA adapters designed to encode both surface and deep persona knowledge, enabling multi-facet personality representation.
Cold startThe problem of having no user data with which to train a User-Model LLM for a new user.
Context enhancementA deployment scenario where the User-Model LLM enriches a user’s query with relevant personal context before it reaches an external agent.
Context critiqueA deployment scenario where the User-Model LLM evaluates an external agent’s response against the user’s known needs and preferences.
DPODirect Preference Optimization. A training method that aligns model outputs with user preferences using a classification loss on preference pairs, without requiring a separate reward model or RL loop.
L0 / L1 / L2The three layers of the LPM memory hierarchy: L0 = raw data (RAG), L1 = natural language summaries, L2 = AI-native memory (model parameters).
LoRALow-Rank Adaptation. A parameter-efficient fine-tuning method that freezes base model weights and trains small low-rank matrices added to each layer. Enables per-user adapters at ~10-100 MB each.
LPMLarge Personal Model. A neural network fine-tuned on a specific user’s data to serve as that user’s memory and reasoning proxy. Each user has their own LPM. Defined by Shang et al. (2024).
Memory parameterizationThe process of encoding a user’s knowledge, preferences, and behavioral patterns into model weights via fine-tuning, rather than storing them in external databases.
Memory QAA deployment scenario where the User-Model LLM answers questions about the user, either to the user themselves (Self) or to third-party systems (Third-party).
PEFTParameter-Efficient Fine-Tuning. A family of techniques (LoRA, adapters, prefix tuning) that train only a small fraction of model parameters while keeping the base model frozen.
Persona optimizerA component that periodically processes recent interaction history and updates the user’s persona representation (either as prompt text or adapter weights).
SFTSupervised Fine-Tuning. Training on (input, desired output) pairs using a standard language modeling loss.
User-Model LLMA fine-tuned LLM whose weights encode a specific user’s identity, preferences, reasoning patterns, and personal history. Serves as the persistent state for user modeling.
Weights-as-stateThe architectural principle that model parameters (not external databases or text profiles) serve as the persistent representation of a user.

Bibliography

[1] Shang, J. et al. “AI-native Memory: A Pathway from LLMs Towards AGI.” arXiv:2406.18312, June 2024. [Verified]
https://arxiv.org/html/2406.18312v1
Fetched and confirmed: defines L0/L1/L2 hierarchy, LPM concept, cold start and catastrophic forgetting challenges.

[2] Prabhakar, A.V. “AI-native Memory and the Rise of Context-Aware AI Agents.” Blog post, June 2025. [Verified]
https://ajithp.com/2025/06/30/ai-native-memory-persistent-agents-second-me/
Fetched via search; confirmed descriptions of L2 encoding memory into model parameters and Second Me framework.

[3] Liu, J. et al. “LLMs + Persona-Plug = Personalized LLMs.” arXiv:2409.11901, September 2024. [Snippet]
https://arxiv.org/abs/2409.11901
Search evidence: Query “AI PERSONA lifelong personalization LLM” returned abstract describing user-specific embedding via lightweight plug-in user embedder module (position 1).

[4] Wei, J. et al. “AI-native Memory 2.0: Second Me.” arXiv:2503.08102, March 2025. [Verified]
https://arxiv.org/html/2503.08102v1
Fetched and confirmed: complete training pipeline (data synthesis, five-level filtering, SFT + DPO on Qwen2.5-7B-Instruct), evaluation metrics, Strong COT scores, hybrid architecture design.

[5] Kannan, A. “Self-Supervised Persona (Auto Fine-Tuning LLMs).” Medium, April 2025. [Snippet]
https://medium.com/@aadharshkannan/self-supervised-persona-auto-fine-tuning-llms-af30fa3ff192
Search evidence: Query “Second Me LLM user model” returned description of ACD framework with WhatsApp fine-tuning results (85.69% accuracy) (position 4).

[6] Wang, T. et al. “AI PERSONA: Towards Life-long Personalization of LLMs.” arXiv:2412.13103, December 2024. [Snippet]
https://arxiv.org/abs/2412.13103
Search evidence: Query “AI PERSONA lifelong personalization LLM” returned abstract on life-long personalization task and persona optimizer (position 5).

[7] Emergent Mind. “Persona-Based LLM System.” Topic page, 2025. [Snippet]
https://www.emergentmind.com/topics/persona-based-language-model-system
Search evidence: Query “AI PERSONA lifelong personalization LLM” returned description of AI PERSONA recursive persona updates, CharLoRA, and Persona-DB architectures (position 4).

[8] Li, R. et al. “How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation.” Findings of ACL 2025. [Snippet]
https://aclanthology.org/2025.findings-acl.813/
Search evidence: Query “LLM user simulator fine-tuned persona generative agent” returned abstract describing BehaviorChain benchmark (15,846 behaviors, 1,001 personas) (position 5).

Prev