保存隐私的强化学习超出了预期

论文标题

保存隐私的强化学习超出了预期

Privacy-Preserving Reinforcement Learning Beyond Expectation

论文作者

Rajabi, Arezoo, Ramasubramanian, Bhaskar, Maruf, Abdullah Al, Poovendran, Radha

论文摘要

配备机器学习算法（例如自动驾驶汽车）的网络和网络物理系统与人类共享环境。在这种情况下，重要的是将系统（或代理）行为与一个或多个人类用户的偏好保持一致。当代理必须在未知环境中学习行为时，我们会考虑这种情况。我们的目标是捕获人类的两个定义特征：i）评估和量化风险的趋势； ii）希望将决策留在外部各方中。我们将累积前景理论（CPT）纳入了前者的增强学习（RL）问题。对于后者，我们使用不同的隐私。我们设计了一种算法，使RL代理能够以隐私的方式学习政策，以最大化基于CPT的目标，并在奖励足够接近时根据算法所学的价值功能的隐私确定保证。这是通过在每个步骤使用高斯过程机制添加校准噪声来实现的。通过经验评估，我们重点介绍了隐私 - 私人折衷方案，并证明RL代理能够学习以隐私的方式与同一环境中人类用户保持一致的行为

Cyber and cyber-physical systems equipped with machine learning algorithms such as autonomous cars share environments with humans. In such a setting, it is important to align system (or agent) behaviors with the preferences of one or more human users. We consider the case when an agent has to learn behaviors in an unknown environment. Our goal is to capture two defining characteristics of humans: i) a tendency to assess and quantify risk, and ii) a desire to keep decision making hidden from external parties. We incorporate cumulative prospect theory (CPT) into the objective of a reinforcement learning (RL) problem for the former. For the latter, we use differential privacy. We design an algorithm to enable an RL agent to learn policies to maximize a CPT-based objective in a privacy-preserving manner and establish guarantees on the privacy of value functions learned by the algorithm when rewards are sufficiently close. This is accomplished through adding a calibrated noise using a Gaussian process mechanism at each step. Through empirical evaluations, we highlight a privacy-utility tradeoff and demonstrate that the RL agent is able to learn behaviors that are aligned with that of a human user in the same environment in a privacy-preserving manner

下载PDF全文

下载文献需遵守相关版权规定

论文标题