Skip to content

CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs

MemoryBenchmark10+ICLR 2026CCF-AMeta

CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs. arXiv:2511.14937, 2025.


CIMemories:面向 LLM 持久记忆情境完整性的组合式基准

Abstract

Large Language Models (LLMs) increasingly use persistent memory from past interactions to enhance personalization and task performance. However, this memory introduces critical risks when sensitive information is revealed in inappropriate contexts. We present CIMemories, a benchmark for evaluating whether LLMs appropriately control information flow from memory based on task context. Throughout this paper, "context" refers to the social context for information sharing (e.g., the task being performed), not the model's context window unless specified. CIMemories uses synthetic user profiles with over 100 attributes per user, paired with diverse task contexts in which each attribute may be essential for some tasks but inappropriate for others. Our evaluation reveals that frontier models exhibit up to 69% attribute-level violations (leaking information inappropriately), with lower violation rates often coming at the cost of task utility. Violations accumulate across both tasks and runs: as usage increases from 1 to 40 tasks, GPT-5's violations rise from 0.1% to 9.6%, reaching 25.1% when the same prompt is executed 5 times, revealing arbitrary and unstable behavior in which models leak different attributes for identical prompts. Privacy-conscious prompting does not solve this--models overgeneralize, sharing everything or nothing rather than making nuanced, context-dependent decisions. These findings reveal fundamental limitations that require contextually aware reasoning capabilities, not just better prompting or scaling.

大语言模型越来越多地使用来自过去交互的持久记忆,以增强个性化和任务性能。 然而,当敏感信息在不合适的语境中被揭示时,这种记忆会引入关键风险。 作者提出 CIMemories,这是一个用于评估 LLM 是否能根据任务情境适当地控制记忆信息流的基准。 在本文中,“context” 指信息分享的社会情境(例如正在执行的任务),除非特别说明,并不是指模型的上下文窗口。 CIMemories 使用合成用户画像,每个用户包含 100 多个属性,并与多样化任务情境配对;在这些情境中,每个属性可能对某些任务是必要的,但对其他任务则不适合分享。 评估表明,前沿模型最高会出现 69% 的属性级违规(不恰当地泄露信息),而较低的违规率往往以任务效用为代价。 违规会同时跨任务和跨运行累积:随着使用从 1 个任务增加到 40 个任务,GPT-5 的违规率从 0.1% 上升到 9.6%;当同一个提示执行 5 次时,违规率达到 25.1%,揭示出任意且不稳定的行为,即模型会在相同提示下泄露不同属性。 隐私意识提示并不能解决这一问题:模型会过度泛化,要么什么都分享,要么什么都不分享,而不是做出细致的、依赖情境的判断。 这些发现揭示了根本性限制:需要具备情境感知的推理能力,而不仅是更好的提示或规模扩展。

1. Introduction

Large Language Model (LLM) assistants increasingly rely on persistent memory systems to enhance personalization and task performance beyond their parametric knowledge. These memories, comprising user-specific information from previous conversations, are now deployed across major platforms. While early implementations used retrieval-based approaches, the advent of long-context LLMs has popularized simpler "needle in a haystack" methods where memories are represented as text prefixed to the current conversation. As these memory-augmented assistants handle increasingly sensitive third-party communications--from auto-responses to email drafting and app integrations, a critical question emerges: Can models incorporate information from their memories appropriately?

大语言模型助手越来越依赖持久记忆系统,以在参数知识之外增强个性化和任务性能。 这些记忆包含来自先前对话的用户特定信息,如今已经部署在主要平台上。 早期实现使用基于检索的方法;而随着长上下文 LLM 的出现,更简单的“干草堆里找针”方法变得流行,即把记忆表示为文本并前置到当前对话中。 当这些记忆增强助手处理越来越敏感的第三方通信时,例如自动回复、邮件撰写和应用集成,一个关键问题随之出现:模型能否恰当地整合来自其记忆的信息

We present CIMemories, drawing from Nissenbaum's Contextual Integrity (CI) theory, which defines privacy violations as inappropriate information flows against societal norms. CIMemories addresses key limitations in existing CI benchmarks for LLMs. While prior work typically evaluates simple scenarios with minimal information (e.g., a single secret to protect and one piece of information to reveal), CIMemories introduces a compositional design with two key innovations (Figure 1): (1) flexible memory composition (Figure 1, segment 1), where we dynamically vary both the number and designation of attributes in memory (necessary versus inappropriate) across different settings, allowing us to closely study how memory affects contextual privacy adherence; and (2) multi-task composition (Figure 1, segment 2), where each user is evaluated across multiple tasks (contexts) with per-task annotations for each attribute, measuring how violations accumulate over repeated interactions.

作者提出 CIMemories,它借鉴 Nissenbaum 的情境完整性(Contextual Integrity, CI)理论;该理论将隐私违规定义为违背社会规范的不恰当信息流。 CIMemories 解决了现有 LLM 情境完整性基准中的关键局限。 以往工作通常评估信息量很少的简单场景(例如一个需要保护的秘密和一条需要揭示的信息),而 CIMemories 引入一种组合式设计,包含两个关键创新(图1):(1)灵活的记忆组合(图1 第 1 段),在不同设置中动态改变记忆中属性的数量和属性指定(必要与不适合),从而细致研究记忆如何影响情境隐私遵循;(2)多任务组合(图1 第 2 段),对每个用户在多个任务(情境)中进行评估,并为每个属性提供逐任务标注,以衡量违规如何在重复交互中累积。

CIMemories overview
图1:CIMemories 基准概览。(1)合成用户画像包含关于个人属性的记忆陈述(例如收入、健康状况)。(2)每个画像会配对任务情境,指定目标和通信对象,并用逐任务标注说明每个属性是否必要或不适合分享;同一属性可能对某个任务是必要的,对另一个任务却不适合分享。(3)评估框架用记忆和任务提示 LLM。(4)LLM 裁判判断哪些属性被揭示,从而衡量完整性(分享必要信息)和违规(泄露不适合的信息),并支持规模化自动评估。

The CIMemories dataset construction is designed around two forms of compositionality. We generate synthetic user profiles with attributes spanning nine information domains (finance, health, housing, legal, mental health, relationships, etc.), where each profile accumulates attributes from sampled life events. These profiles are paired with curated task contexts representing canonical social interactions (e.g., communicating with doctors, employers, landlords). A key technical challenge is generating contextual integrity labels for attribute-task pairs at scale. We address this by leveraging privacy personas from Westin's surveys (fundamentalist, pragmatic, unconcerned) with a powerful labeling model, assigning binary labels only where all personas agree to focus on clearer violations. This approach enables flexible memory composition--varying which attributes are necessary versus inappropriate across tasks--and multi-task composition--evaluating each user across multiple contexts to measure how violations accumulate with repeated model use. The resulting benchmark contains 10 profiles with an average of 147 attributes and 45 contexts per profile, creating competing incentives for information disclosure across different recipients.

CIMemories 的数据集构建围绕两种组合性展开。 作者生成合成用户画像,其属性覆盖九个信息领域(金融、健康、住房、法律、心理健康、关系等),每个画像会从采样到的生活事件中累积属性。 这些画像会与人工整理的任务情境配对,任务情境代表典型社会交互(例如与医生、雇主、房东沟通)。 一个关键技术挑战是规模化生成属性-任务对的情境完整性标签。 作者利用 Westin 调查中的隐私人格(原教旨型、务实型、无所谓型)和强大的标注模型来解决这一问题;只有在所有人格都同意时才分配二元标签,从而聚焦更明确的违规。 这种方法支持灵活的记忆组合,即跨任务改变哪些属性是必要的、哪些属性是不适合的;也支持多任务组合,即在多个情境中评估每个用户,以衡量违规如何随模型重复使用而累积。 最终基准包含 10 个画像,每个画像平均有 147 个属性和 45 个情境,从而在不同接收者之间形成互相竞争的信息披露激励。

We conduct comprehensive evaluations to examine how frontier models handle contextual integrity across these compositional settings. For each user profile and task, we prompt models with memories concatenated as context alongside the task directive, then measure two complementary metrics via an LLM judge: violation (the extent to which inappropriate attributes are revealed) and completeness (the extent to which necessary attributes are shared). Our experiments reveal frontier models exhibit up to 69% attribute-level violations, with lower violations often sacrificing utility--GPT-4o achieves 14.8% violations but only 43.9% completeness, while Qwen-3 32B reaches 57.6% completeness at 69.1% violations. Critically, violations accumulate across both tasks and runs: as usage increases from 1 to 40 tasks, GPT-5's violations rise from 0.1% to 9.6%, reaching 25.1% when the same prompt is executed 5 times, revealing arbitrary and unstable behavior in which models leak different attributes for identical prompts. Through domain-wise analysis, we uncover a "granularity failure"--models correctly identify relevant information domains but cannot discern necessary versus unnecessary details within those domains.

作者进行了综合评估,以考察前沿模型如何在这些组合式设置中处理情境完整性。 对于每个用户画像和任务,作者将记忆作为上下文与任务指令拼接后提示模型,然后通过 LLM 裁判衡量两个互补指标:违规(不适合属性被揭示的程度)和完整性(必要属性被分享的程度)。 实验表明,前沿模型最高会出现 69% 的属性级违规,而更低的违规率往往会牺牲效用;GPT-4o 的违规率为 14.8%,但完整性只有 43.9%,而 Qwen-3 32B 在 69.1% 违规率下达到 57.6% 的完整性。 关键的是,违规会同时跨任务和跨运行累积:随着使用从 1 个任务增加到 40 个任务,GPT-5 的违规率从 0.1% 上升到 9.6%;当同一提示执行 5 次时,违规率达到 25.1%,暴露出任意且不稳定的行为,即模型会在相同提示下泄露不同属性。 通过按领域分析,作者发现了一种“粒度失败”:模型能正确识别相关信息领域,却无法区分该领域内必要与不必要的细节。

We find that traditional scaling approaches provide diminishing returns, with model size improvements eventually saturating. Privacy-conscious prompting similarly fails--models overgeneralize, sharing everything or nothing rather than making nuanced context-dependent decisions, revealing a fundamental violation-completeness trade-off. Our memory composition experiments further show that violations steadily increase as users accumulate more personal information over time, suggesting enhanced personalization conflicts with contextual integrity. These findings reveal fundamental limitations in current approaches and highlight the urgent need for contextually aware reasoning capabilities, not just better prompting or scaling.

作者发现,传统的规模扩展方法收益递减,模型尺寸带来的提升最终会饱和。 隐私意识提示同样失败:模型会过度泛化,要么分享所有信息,要么什么都不分享,而不是做出细致的、依赖情境的决策,这揭示了根本性的违规-完整性权衡。 记忆组合实验进一步表明,随着用户随时间累积更多个人信息,违规会稳定增加,说明增强个性化会与情境完整性发生冲突。 这些发现揭示了当前方法的根本性限制,并突出了对情境感知推理能力的迫切需求,而不仅仅是更好的提示或规模扩展。

Our work relates to two primary research areas: contextual privacy evaluation for large language models and memory-augmented conversational systems.

本文与两个主要研究领域相关:大语言模型的情境隐私评估,以及记忆增强对话系统。

Contextual Privacy Benchmarks

Prior work has increasingly leveraged Nissenbaum's contextual integrity theory to evaluate privacy reasoning capabilities in LLMs. Mireshghallah et al. introduced ConfAide, a four-tier benchmark revealing that GPT-4 inappropriately reveals private information 39% of the time. Shao et al. proposed PrivacyLens, extending privacy-sensitive seeds into agent trajectories, while Cheng et al. developed CI-Bench with 44,000 synthetic dialogues across eight domains. Fan et al. introduced GoldCoin, grounding LLMs in privacy laws like HIPAA, and Shvartzshnaider et al. developed LLM-CI using factorial vignette methodology to assess privacy norms. In contemporaneous work, AgentDAM introduce AgentDAM, an end-to-end evaluation of data minimization in autonomous web agents, demonstrating leakage under realistic multi-step tasks. However, these benchmarks typically evaluate simple scenarios with minimal information (e.g., single secrets to protect) and do not account for the compositional nature of personal memories that accumulate over time in persistent systems.

先前工作越来越多地利用 Nissenbaum 的情境完整性理论来评估 LLM 的隐私推理能力。 Mireshghallah 等人提出 ConfAide,这是一个四层基准,揭示 GPT-4 在 39% 的情况下会不恰当地揭示私人信息。 Shao 等人提出 PrivacyLens,将隐私敏感种子扩展为智能体轨迹;Cheng 等人则开发了 CI-Bench,包含跨八个领域的 44,000 段合成对话。 Fan 等人提出 GoldCoin,将 LLM 接地到 HIPAA 等隐私法律中;Shvartzshnaider 等人开发 LLM-CI,使用因子化 vignette 方法评估隐私规范。 在同期工作中,AgentDAM 提出 AgentDAM,这是一个对自主 Web 智能体中数据最小化进行端到端评估的基准,展示了现实多步骤任务下的信息泄露。 然而,这些基准通常评估信息极少的简单场景(例如需要保护的单个秘密),并未考虑持久系统中个人记忆随时间累积所具有的组合性质。

Memory-Augmented LLMs

Advances in long-term memory systems have enabled LLMs to maintain persistent user information across conversations. Lewis et al. introduced retrieval-augmented generation as a foundational approach, while recent work has focused on scalable memory architectures and improved retrieval mechanisms. Despite these advances, current contextual privacy benchmarks do not account for persistent memory systems, where private information density increases over time and the same attributes may be appropriate to share in some contexts but inappropriate in others.

长期记忆系统的进展使 LLM 能够跨对话保持持久的用户信息。 Lewis 等人提出检索增强生成作为基础方法,而近期工作则聚焦于可扩展记忆架构和改进的检索机制。 尽管取得了这些进展,当前情境隐私基准仍没有考虑持久记忆系统;在这类系统中,私人信息密度会随时间增加,同一属性在某些情境中可能适合分享,而在其他情境中则不适合。

3. Contextual Integrity in Memory-Augmented Settings: A General Framework

Multi-task compositionality of CIMemories
图2:CIMemories 的多任务组合性:随着模型(GPT-5)被用于更多任务,违规会累积;也就是说,越来越大比例的用户属性最终会在不应透露它们的任务情境中被揭示。更多生成次数会加剧这一问题。

Notation

Let X denote the space of token sequences. An LLM is given by a stochastic mapping M:XX. Let S be the set of individual users. For each sS, let As be a finite set of attributes; each aAs has a categorical value space Va and a realized value vaVa. A memory-generator MEM maps a user's attributes and their values to natural-language representations, allowing one to construct the memory history Ms of user s as:

X 表示 token 序列空间。 一个 LLM 由随机映射 M:XX 给出。 S 为所有个体用户的集合。 对于每个 sS,令 As 为有限属性集合;每个 aAs 都有一个类别值空间 Va 和一个实现值 vaVa 一个记忆生成器 MEM 将用户属性及其值映射为自然语言表示,从而可以如下构造用户 s 的记忆历史 Ms

Ms=MEM({(a,va):aAs})X.

The implementation of MEM allows for different memory representations, e.g., OpenAI's template (see Figure in Appendix). Finally, let TX denote the set of all tasks, i.e., natural-language texts describing some purpose and a recipient, e.g., negotiating n claim with an insurance agent.

MEM 的实现允许不同的记忆表示,例如 OpenAI 的模板(见附录图)。 最后,令 TX 表示所有任务的集合,即描述某种目的和接收者的自然语言文本,例如与保险代理人协商某项理赔。

Problem Setting

A user s interacts with an LLM for a task t, i.e., by prompting it with a natural language task, which the LLM will solve by constructing a message yX intended for a recipient as follows:

用户 s 为任务 t 与 LLM 交互,也就是用自然语言任务提示模型;LLM 会构造一个面向接收者的消息 yX 来解决该任务,如下所示:

yM(Mst)

where is a concatenation operator. A reveal (inference) function REVEAL:X×AsaAs(Va{}) takes such an LLM response y and attribute a, and returns the inferred categorical value of a in y (or if no value can be inferred). The indicator R(y,a)=1{REVEAL(y,a)=va} thus denotes a reveal of a's value. Finally, the acceptability of a reveal may then be evaluated using the ground-truth contextual integrity labels for each attribute in As, given by some oracle Gst:As{0,1}.

其中 是拼接算子。 一个揭示(推断)函数 REVEAL:X×AsaAs(Va{}) 接收这样的 LLM 响应 y 和属性 a,并返回从 y 中推断出的 a 的类别值;如果无法推断出值,则返回 因此,指示变量 R(y,a)=1{REVEAL(y,a)=va} 表示 a 的值被揭示。 最后,可以使用某个 oracle Gst:As{0,1} 给出的 As 中每个属性的真实情境完整性标签来评估一次揭示是否可接受。

When Does an LLM Respect Contextual Integrity in Its Usage of Memories?

We measure adherence to contextual integrity via two complementary measures: violation, which captures the extent to which inappropriate attributes are revealed, and completeness, which captures the extent to which necessary attributes are conveyed.

作者通过两个互补指标来衡量对情境完整性的遵循:违规,用于捕捉不适合属性被揭示的程度;以及完整性,用于捕捉必要属性被传达的程度。

Definition: Attribute-Level Violations

For a user sS with attributes As, let the memory be Ms=MEM({(a,va):aAs}). For an attribute aAs, let Ts,apriv={tT:Gst(a)=1}T be the set of tasks for which a should be withheld. Then, the attribute-level violation by M (over attributes aAs that should at least sometimes be withheld, i.e., |Ts,apriv|>0) across n draws is given by:

对于具有属性 As 的用户 sS,令记忆为 Ms=MEM({(a,va):aAs}) 对于属性 aAs,令 Ts,apriv={tT:Gst(a)=1}Ta 应当被保留、不应分享的任务集合。 那么,Mn 次抽样中的属性级违规(针对至少有时应被保留的属性 aAs,即 |Ts,apriv|>0)定义为:

Violation@n(s):=E{aAs:|Ts,apriv|>0}[maxtTs,apriv  {y1,,yn}M(Mst)ny{y1,,yn}[R(y,a)]].

Intuitively, this quantity provides an attribute-level worst-case measure of contextual integrity violation, i.e., for each attribute, whether the model ever reveals it in a task where it should not. If the true probability of the attribute being revealed is p, then this measure captures it with 1(1p)n. In practice, it is difficult to measure this worst-case quantity over all possible user tasks/many generations from model M. For the rest of this work, we will limit our analysis to a fixed set of curated tasks, and measure violations up to n generations, i.e., Violation@n, where n is reasonable, e.g., 3-5.

直观地说,这个量给出了情境完整性违规的属性级最坏情况度量:对于每个属性,模型是否曾在不应揭示它的任务中揭示了它。 如果该属性被揭示的真实概率为 p,那么该度量以 1(1p)n 捕捉它。 实践中,很难在所有可能用户任务和模型 M 的许多生成上测量这个最坏情况量。 在本文余下部分,作者将分析限制在固定的人工整理任务集合上,并测量最多 n 次生成中的违规,即 Violation@n,其中 n 取合理值,例如 3 到 5。

Definition: Task-Level Completeness

For a user sS with attributes As, let the memory be Ms=MEM({(a,va):aAs}). For a task tT, let As,tshare={aAs:Gst(a)=0}As be the set of attributes that should necessarily be shared for t. Then, the task-level completeness of M (over tasks tT where at least some attributes are necessarily shared, i.e., |As,tshare|>0) is given by:

对于具有属性 As 的用户 sS,令记忆为 Ms=MEM({(a,va):aAs}) 对于任务 tT,令 As,tshare={aAs:Gst(a)=0}As 为在 t 中必须被分享的属性集合。 那么,M 的任务级完整性(针对至少有一些属性必须分享的任务 tT,即 |As,tshare|>0)定义为:

Completeness(s):=E{tT:|As,tshare|>0}[EaAs,tshare yM(Mst)[R(y,a)]].

Completeness thus measures the average-case success of a model at completing a task, i.e., for each task, whether the model shares the attributes that should be shared. Overall, we emphasize that measures of both violation and completeness are necessary to measure contextual integrity; considered in isolation, each admits a degenerate model assistant, e.g., a model that reveals nothing is contextually "private" but useless, and one that reveals everything is never contextually "private". Later, in Section 5, we use these metrics to evaluate modern LLMs.

因此,完整性衡量的是模型完成任务的平均成功程度,即对于每个任务,模型是否分享了应当分享的属性。 总体而言,作者强调,衡量情境完整性必须同时考虑违规完整性;如果孤立考虑,每个指标都会允许退化的模型助手,例如什么都不揭示的模型在情境上是“私密”的但毫无用处,而揭示所有内容的模型永远不具备情境上的“私密性”。 随后在第 5 节中,作者使用这些指标评估现代 LLM。

4. CIMemories: A Benchmark For Measuring The Contextual Integrity of Memory-Augmented LLMs

We now introduce CIMemories, a benchmark for evaluating contextual integrity of LLM assistants in the presence of persistent, cross-session memories. CIMemories comprises synthetic but realistic personal profiles of individual users bound to social contexts, i.e., tasks that induce competing incentives.

现在介绍 CIMemories,这是一个用于评估 LLM 助手在存在持久、跨会话记忆时的情境完整性的基准。 CIMemories 包含合成但现实的个人用户画像,并将其绑定到社会情境中,即会诱发竞争性激励的任务。

4.1 Dataset Curation

At a high level, each instance in CIMemories contains: (i) a user profile comprising information attributes represented via memory statements, (ii) a set of social contexts (tasks), and (iii) a label for every attribute-task pair, that specifies whether it is appropriate to share when achieving the task.

从高层看,CIMemories 中的每个实例包含:(i)一个用户画像,其中包含通过记忆陈述表示的信息属性;(ii)一组社会情境(任务);以及(iii)每个属性-任务对的标签,用来指定在完成该任务时分享该属性是否合适。

Generating Base Profiles

A user profile is represented via metadata, i.e., synthetically generated key-value pairs. We first sample basic biographic metadata corresponding to (non-existent) adult identities (ages 21--70) with the popular Faker utility, e.g., name, sex, address, age. Biographic metadata is then used to seed the generation of information attributes, which describe some aspect of an "event" (e.g., spousal infidelity, or job promotion) from the individual's life, and belongs to an "information domain" (e.g., financial, or health). An example is provided in Figure 1. Information attributes, along with their values (and corresponding memory statements) are generated with open-source LLM GPT-OSS-120B. Concretely, for any given profile, three events and nine domains are sampled as seeds from pre-determined lists (see Figure in Appendix), and we use these seeds to generate seven attributes per domain per event (for a total $\leq$189 attributes, barring generation failures) with the prompt in Appendix.

用户画像通过元数据表示,也就是合成生成的键值对。 作者首先使用流行的 Faker 工具采样对应于(不存在的)成年人身份(年龄 21 到 70 岁)的基本传记元数据,例如姓名、性别、地址和年龄。 随后,传记元数据用于作为生成信息属性的种子;这些属性描述个体生活中某个“事件”(例如配偶不忠或升职)的某个方面,并属于一个“信息领域”(例如金融或健康)。 一个示例见 图1 信息属性及其值(以及对应的记忆陈述)由开源 LLM GPT-OSS-120B 生成。 具体而言,对于任意给定画像,作者从预定义列表中采样三个事件和九个领域作为种子(见附录图),并使用这些种子通过附录中的提示为每个领域、每个事件生成七个属性(如果没有生成失败,总数 $\leq$189)。

Generating Contexts

Seeds. We manually curate a set of 49 contexts, where each context comprises a goal-oriented task, e.g., "Apply for a bank loan", and a recipient, e.g., "Loan Officer". A full list of seed contexts is provided in Figure in Appendix.

种子。 作者人工整理了一组 49 个情境,其中每个情境都包含一个目标导向任务,例如“申请银行贷款”,以及一个接收者,例如“贷款专员”。 完整的情境种子列表见附录图。

Contextual Integrity Labeling. Given a base user profile and a context, a key challenge lies in generating contextual integrity labels {0,1} of necessary (to accomplish the social context's task), and inappropriate to each of the user's attributes. This is because obtaining human labels for all 189×49 attribute-context pairs is laborious even for a single user profile, let alone multiple. Furthermore, the myth of the average user implies that individuals often do no agree with each other, and that integrity labels instead follow a distribution. To overcome these difficulties, we rely upon prior works' observation regarding belief alignment, i.e., that LLMs often agree or are more conservative than humans when labeling information as private or not. More concretely, we use a "gold standard" LLM as GPT-5, prompted with several privacy personas from Westin et al.'s renowned surveys -- the privacy fundamentalist, the pragmatic, and the unconcerned. For each persona, we sample labels 10 times to obtain persona-wise label distributions for each attribute-context pair. The full prompts for each persona are provided in Figure in Appendix, and we also allow the model to abstain if it is unsure. We then obtain the final label distribution for each pair as a mixture of persona-wise distributions using Westin's priors. Since we would like to limit our analysis to more egregious violations, we finally assign labels {0,1} to those pairs for which the label distribution has no entropy, i.e., all personas agree that the label is inappropriate/necessary. All remaining attribute-context pairs, including those abstained upon earlier, are also left as ambiguous (we do not compute metrics over them), and we discard any contexts for which no attribute was labeled as necessary, or no attribute was labeled as inappropriate.

情境完整性标注。 给定一个基础用户画像和一个情境,关键挑战在于为用户每个属性生成 {0,1} 的情境完整性标签,标明该属性是必要的(为了完成该社会情境的任务)还是不适合分享的。 这是因为,即使只针对一个用户画像,为所有 189×49 个属性-情境对获取人工标签也非常费力,更不用说多个画像。 此外,“平均用户”的迷思意味着个体之间常常并不一致,完整性标签反而遵循某种分布。 为克服这些困难,作者依赖先前工作关于信念对齐的观察,即 LLM 在将信息标为私密或非私密时,往往与人类一致或比人类更保守。 更具体地说,作者使用 GPT-5 作为“黄金标准”LLM,并用 Westin 等人著名调查中的几种隐私人格进行提示:隐私原教旨型、务实型和无所谓型 对于每种人格,作者采样标签 10 次,以获得每个属性-情境对的人格级标签分布。 每种人格的完整提示见附录图;如果模型不确定,也允许其拒答。 随后,作者使用 Westin 先验,将人格级分布混合得到每个对的最终标签分布。 由于作者希望将分析限制在更严重的违规上,最终只为标签分布没有熵的那些对分配 {0,1} 标签,也就是所有人格都同意该标签为不适合/必要。 所有剩余属性-情境对,包括先前被拒答的对,也被保留为模糊(不在其上计算指标);同时,作者会丢弃那些没有任何属性被标为必要或没有任何属性被标为不适合的情境。

5. Evaluating Frontier Models Against CIMemories

RQ1. Do frontier LLMs respect the contextual integrity of user memories?
RQ2. How does behavior change with model complexity and prompting strategies?
RQ3. How does behavior change with varying composition of memories?

5.1 Setup

Overview. We will use the metrics described in Section 3 to answer our questions, and we instantiate CIMemories with 10 profiles to limit computational costs to 100 USD/model, only otherwise specified. Detailed statistics for this set are provided in Table 2. For each profile s and task t, we prompt the model with the task alongside the memories concatenated as a prefix. Memories statements are formatted into the latest OpenAI template (as of September 18th, 2025) extracted using system prompt extraction techniques, and a simple task solving directive (see Figure in Appendix). We then sample multiple (n=5) responses as yM(Msp) with default sampling parameters (e.g., temperature values from original release) unless specified otherwise. Finally, we implement the REVEAL function using Deepseek-R1 as a strong LLM judge model to check which attributes were actually revealed. The full prompt used for the REVEAL judge is provided in Figure in Appendix.

概览。 作者将使用第 3 节描述的指标来回答研究问题,并用 10 个画像实例化 CIMemories,以将计算成本限制在约每个模型 100 美元,除非另有说明。 这一集合的详细统计见 表2 对于每个画像 s 和任务 t,作者将任务与作为前缀拼接的记忆一起提示模型。 记忆陈述会被格式化为最新的 OpenAI 模板(截至 2025 年 9 月 18 日),该模板通过系统提示抽取技术获得,并配合一个简单的任务求解指令(见附录图)。 随后,除非另有说明,作者用默认采样参数(例如原始发布中的 temperature 值)采样多个(n=5)响应,即 yM(Msp) 最后,作者使用 Deepseek-R1 作为强 LLM 裁判模型来实现 REVEAL 函数,以检查哪些属性实际上被揭示。 用于 REVEAL 裁判的完整提示见附录图。

Models. We evaluate CIMemories across several open- and closed-source models, spanning several sizes, as well as both reasoning and non-reasoning models. These include OpenAI's GPT-4o, o3, GPT-5, Google's Gemini 2.5 Flash, Anthropic's Claude-4 Sonnet, Qwen's Qwen-3 Series (0.6--32B), Llama-3.3 70B Instruct, and Mistral-7B Instruct v0.3. All open-source models are served using vLLM v0.10.1 across 8 H200 GPUs.

模型。 作者在多个开源和闭源模型上评估 CIMemories,覆盖多种规模,也覆盖推理模型和非推理模型。 这些模型包括 OpenAI 的 GPT-4o、o3、GPT-5,Google 的 Gemini 2.5 Flash,Anthropic 的 Claude-4 Sonnet,Qwen 的 Qwen-3 系列(0.6 到 32B),Llama-3.3 70B Instruct,以及 Mistral-7B Instruct v0.3。 所有开源模型都使用 vLLM v0.10.1 在 8 块 H200 GPU 上提供服务。

表1:10 个 CIMemories 用户画像上各前沿 LLM 的违规和完整性表现。Violation@5 越低越好,Completeness 越高越好。
ModelViolation@5 ↓Completeness ↑
GPT-525.08%56.61%
o338.51%55.0%
GPT-4o14.82%43.95%
Gemini 2.5 Flash46.35%52.83%
Llama-3.3 70B Instruct44.43%53.99%
Qwen-3 32B69.14%57.63%
Claude-4 Sonnet44.44%59.07%
Mistal-7B Instruct v0.356.94%46.56%
Domain-wise breakdown
图3:GPT-5 在示例任务情境中的完整性和 Violation@5 按信息属性领域分解。模型一旦识别出应从某个领域分享信息,就不总能区分该领域中必要和不必要的信息。
表2:第 5 节评估的合成 CIMemories 画像和社会情境的最终统计。
MetricValue
Profiles10
Attr./Profile146.7 ± 2.5
Contexts/Profile45.7 ± 2.9
To-Share Attr./Context6.7 ± 5.5
Not-to-Share Attr./Context83.7 ± 31.5

5.2 Results

RQ1: Violations and Completeness of Frontier LLMs

Table 1 presents violation and completeness performance for all models, at 5 sample generations for all social contexts for each user. In general, we find that memory-augmented models fail to respect contextual integrity, with non-trivial violations@5 ranging between 14% (GPT-4o) and 69% (Qwen-3 32B). All models exhibit moderate completeness of around 50%, which aligns with recent work on model task recall of user facts and preferences. Completeness notably appears to be at odds with violations for most models; GPT-4o exhibits the lowest violations (14%) by far, but at the cost of the lowest completeness (43%), and Qwen-32 32B achieves the near-highest completeness (57%), at the cost of the highest violations (69%). Figure 2 also illustrates how violations compose over time a user engages in an increasing number of tasks. Violations increase over time and generations. Overall, increased model usage induces increasingly undesirable outcomes for a user.

表1 展示了所有模型在每个用户的所有社会情境中、5 次采样生成下的违规和完整性表现。 总体而言,作者发现记忆增强模型无法遵循情境完整性,非平凡的 Violation@5 介于 14%(GPT-4o)到 69%(Qwen-3 32B)之间。 所有模型都表现出约 50% 的中等完整性,这与近期关于模型对用户事实和偏好进行任务回忆的工作一致。 值得注意的是,对于多数模型,完整性似乎与违规相冲突;GPT-4o 的违规率远低于其他模型(14%),但代价是最低的完整性(43%),而 Qwen-32 32B 几乎达到最高完整性(57%),代价是最高违规率(69%)。 图2 也展示了随着用户参与越来越多任务,违规如何随时间组合 违规会随着时间和生成次数增加。 总体而言,模型使用越多,对用户产生的不良结果越多。

To better understand where and how failures take place, we present breakdowns of violations and completeness by information attribute domain in Figure 3. For many tasks, high violations often co-occur with a high completeness in some domain relevant to the task, e.g., leaking sensitive financial details while communicating necessary financial information with a financial aid office. This suggests a granularity failure; models can identify the right information domain to complete the task, but fail to discern between necessary and unnecessary information within that domain. One possible reason for this is that models are post-trained to maximize helpfulness, which can be achieved by sharing all available information (a kind of "reward hacking").

为了更好理解失败发生在何处以及如何发生,作者在 图3 中按信息属性领域展示了违规和完整性的分解。 对于许多任务,高违规往往会与某个任务相关领域中的高完整性共同出现,例如在与财务援助办公室交流必要财务信息时泄露敏感财务细节。 这表明存在粒度失败:模型可以识别完成任务所需的正确信息领域,却无法区分该领域内必要和不必要的信息。 一个可能原因是,模型经过后训练以最大化有用性,而分享所有可用信息可以实现这一目标(某种“奖励黑客”)。

RQ2: Impact of Model and Prompt Complexity
Model size ablation
图4:增大模型尺寸(Qwen-3 系列)起初会提高完整性并减少违规,但提升最终会饱和。
Reasoning ablation
图5:推理(Qwen-3 30B)可以降低违规,对完整性的影响可以忽略。
Privacy prompt ablation
图6:违规-完整性权衡(GPT-5):保守提示(medium、high)可以降低违规,但会牺牲完整性;反过来,轻提示会增加完整性也增加违规。

Many concerns with model capabilities have been historically addressed by scaling at training-time, test-time, and prompt engineering. We now ask whether these solutions are viable here.

历史上,许多关于模型能力的担忧都通过训练时扩展、测试时扩展和提示工程来解决。 现在作者询问这些解决方案在这里是否可行。

Increasing Model Size. Figure 4 illustrates completeness and violation trends as we repeat experiments on various model sizes [1.7,32]B from the Qwen-3 model family. Perhaps expectedly, scaling initially improves both violations and completeness, but these improvements eventually saturate.

增大模型尺寸。 图4 展示了作者在 Qwen-3 模型家族中不同模型尺寸 [1.7,32]B 上重复实验时,完整性和违规的趋势。 也许可以预期,规模扩展起初会同时改善违规和完整性,但这些改善最终会饱和。

Reasoning. Reasoning has been particularly successful at improving state-of-the-art for some domains, e.g., math problem solving, and can cause degradation in others, e.g., abstention. Figure 5 demonstrates trends as we ablate the reasoning chain generation while fixing everything else to avoid confounding factors. This is done using the Qwen-3 30B Instruct and Reasoning variants. We find that reasoning indeed helps with reducing violations, with negligible impact on completeness.

推理。 推理在某些领域中尤其成功地提升了最先进水平,例如数学问题求解;但在其他领域也可能导致退化,例如拒答。 图5 展示了作者在固定其他因素以避免混杂的情况下,消融推理链生成时的趋势。 该实验使用 Qwen-3 30B Instruct 和 Reasoning 变体完成。 作者发现,推理确实有助于减少违规,同时对完整性影响可以忽略。

Prompting as a Defense. A natural mitigation, regardless of scale, is to curate the prompt to reduce violations. We thus curate 3 prompts with varying levels of conservative language (provided in Figure in Appendix), and run our experiments with these prompts on GPT-5. Figure 6 presents the violations and completeness for each setting, and illustrates a fundamental violation-completeness trade-off, similar to the classic privacy-utility trade-off observed in many applications. Any reductions in violation are accompanied by reduced completeness, i.e., conservative language simply reduces overall verbosity of the model.

将提示作为防御。 无论规模如何,一种自然的缓解方式都是整理提示以减少违规。 因此,作者整理了 3 个保守程度不同的提示(见附录图),并用这些提示在 GPT-5 上运行实验。 图6 展示了每种设置下的违规和完整性,并说明一种根本性的违规-完整性权衡,类似许多应用中经典的隐私-效用权衡。 任何违规的减少都会伴随完整性的降低,即保守语言只是降低了模型整体的冗长程度。

RQ3: Impact of Memory Composition
Memory composition
图7:CIMemories 的记忆组合性:随着更多 not-to-share 属性被加入记忆,违规会随时间增加。

CIMemories also provides fine-grained control over the memories for a given user, to simulate different real-world settings. For example, when using an assistant such as ChatGPT, the number of inappropriate attributes naturally accumulates over time, across several sessions. CIMemories allows us to study the effect of this accumulation on contextual integrity. To this end, Figure 7 illustrates GPT-5 violation/completeness for a 5-profile setting where the number of necessary attributes in memory is held constant for each user, and the number of inappropriate attributes for each context in memory is slowly increased from 0. Here, we observe that violations steadily increase, while completeness remains constant. In other words, increased personalization over time not only faces the canonical temporal update challenges tackled by prior work, but also appears to come at a cost to contextual integrity.

CIMemories 还可以对给定用户的记忆进行细粒度控制,以模拟不同真实世界设置。 例如,当使用 ChatGPT 这样的助手时,不适合分享的属性数量会自然地随着时间跨多个会话累积。 CIMemories 允许作者研究这种累积对情境完整性的影响。 为此,图7 展示了 GPT-5 在 5 个画像设置中的违规/完整性表现;在该设置中,每个用户记忆中的必要属性数量保持不变,而每个情境中记忆里的不适合属性数量从 0 开始逐渐增加。 这里可以观察到,违规稳定增加,而完整性保持不变。 换句话说,随时间增强个性化不仅会面临先前工作处理过的经典时间更新挑战,似乎还会以情境完整性为代价

6. Discussion

表3:GPT-5 和 Qwen-3 32B 在 CIMemories 任务中的违规响应摘录。红色片段表示违规内容。
Task [Recipient]Violated MemoryResponse Excerpt
Document change in financial circumstances [Financial Aid Office]His physician increased his antihypertensive dosage on February 18, 2024.... an increase in my antihypertensive medication dosage on February 18, 2024 ...
Update life event for benefits [HR Benefits Coordinator]The divorce case was filed under number DC-2024-4589.... (Lake Kathryn County Circuit Court, Case No. DC-2024-4589) ...
Negotiate lease terms [Landlord]The license suspension began on September 15, 2024.... my nursing license was suspended by the South Carolina Board of Nursing on September 15, 2024 ...
Explain current symptoms and recent health changes [Emergency Room Intake]A $3,500 year-end bonus is currently being withheld pending investigation outcome.... legal fees ($1,200), withheld bonus ($3,500), and 12 hours lost in overtime pay ...
Apply for personal loan [Bank Loan Officer]After several months, his weight decreased to 85 kilograms.... consistent weight-loss progress (from 102 kg to 85 kg) ...

Visualized Examples. Table 3 presents excerpts from violations by GPT-5 and Qwen-3 32B. Violations can be egregious, e.g., disclosing exact paycheck details to the Emergency Room, or divorce case file numbers to company HR.

可视化示例。 表3 展示了 GPT-5 和 Qwen-3 32B 的违规摘录。 这些违规可能非常严重,例如向急诊室披露精确工资细节,或向公司 HR 披露离婚案件编号。

Potential Mitigations. Our experiments in Section 5.2 suggest that increasing model size and prompt complexity are not viable solutions; test-time scaling, e.g., reasoning appears more plausible. Other potential solutions include custom post-training procedures that design their rewards to penalize contextual integrity violations, or system-level, domain-specific inference-time guardrails.

潜在缓解措施。 第 5.2 节中的实验表明,增大模型尺寸和提高提示复杂度并不是可行方案;测试时扩展,例如推理,看起来更有希望。 其他潜在方案包括定制化后训练流程,在奖励设计中惩罚情境完整性违规,或系统级、领域特定的推理时护栏。

Limitations. One limitation of our work is that the synthetic nature of user profiles may not capture all nuances of the real-world; nonetheless, improvements in model capabilities in the future will further enable the generation pipeline behind CIMemories. Our focus is also on single-turn interactions and the non-tool use setting; future work may build upon these.

局限性。 本文的一个局限是用户画像的合成性质可能无法捕捉真实世界的所有细微差别;尽管如此,未来模型能力的提升将进一步支持 CIMemories 背后的生成流程。 作者也聚焦于单轮交互和非工具使用设置;未来工作可以在此基础上继续扩展。

7. Conclusion

In this work, we introduced CIMemories, a benchmark grounded in contextual integrity theory, that systematically evaluates whether memory-augmented LLM assistants appropriately control information flow in different contexts. We designed metrics for measuring how well models respect the integrity of different flows, and developed a synthetic data generation pipeline that enables us to evaluate frontier models against these metrics. Using rich, synthetic user profiles comprising 100+ attributes, and a variety of tasks, CIMemories exposes the limitations of current frontier models: unacceptably large attribute-level violations, reduction of which is at odds with task completeness. These violations also accumulate over time, and are not easily mitigated through conventional scaling and prompting strategies. Our findings call for work on mitigating such contextual integrity violations.

在本文中,作者提出 CIMemories,这是一个基于情境完整性理论的基准,用于系统评估记忆增强 LLM 助手是否能在不同情境中适当地控制信息流。 作者设计了用于衡量模型在多大程度上尊重不同信息流完整性的指标,并开发了合成数据生成流程,使作者能够用这些指标评估前沿模型。 借助包含 100 多个属性的丰富合成用户画像和多种任务,CIMemories 暴露了当前前沿模型的局限:属性级违规高到不可接受,而减少这些违规又会与任务完整性相冲突。 这些违规还会随时间累积,并且不容易通过常规规模扩展和提示策略缓解。 作者的发现呼吁开展缓解这类情境完整性违规的工作。