论文标题

噪声吸引统计推断,具有差异私有合成数据

Noise-Aware Statistical Inference with Differentially Private Synthetic Data

论文作者

Räisä, Ossi, Jälkö, Joonas, Kaski, Samuel, Honkela, Antti

论文摘要

虽然在差异隐私(DP)下的合成数据产生在数据隐私社区中受到了很多关注,但对合成数据的分析收到的分析要少得多。现有的工作表明,简单地分析DP合成数据就好像是真实的,不会产生人口水平数量的有效推断。例如,置信区间变得太狭窄,我们通过一个简单的实验证明了这一点。我们通过将来自多个插补(MI)领域的合成数据分析技术结合到使用噪声吸引力(NA)贝叶斯建模的合成数据来解决这个问题,从而将bayesian建模(NA)建模为管道NA+MI,从而可以从DP合成数据中计算出对人群级数的准确不确定性估计。为了使用边缘查询的值实现NA+MI进行离散数据生成,我们使用最大熵的原理开发了一种新型的噪声吸引合成数据生成NAPSU-MQ。我们的实验表明,管道能够从DP合成数据中产生准确的置信区间。间隔变得​​更宽,并具有更严格的私密性,以准确捕获DP噪声引起的额外不确定性。

While generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation (MI), and synthetic data generation using noise-aware (NA) Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation using the values of marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源