论文标题

TextWash-自动开源文本匿名

Textwash -- automated open-source text anonymisation

论文作者

Kleinberg, Bennett, Davies, Toby, Mozes, Maximilian

论文摘要

社会科学研究中文本数据的使用增加从易于获取的数据(例如Twitter)中受益。这一趋势是以研究成本需要敏感但难以分享的数据的成本(例如,访谈数据,警察报告,电子健康记录)。我们使用开源文本匿名软件_textwash_介绍了该僵局的解决方案。本文使用TILD标准对工具进行了经验评估:技术评估(工具的准确性?),信息丢失评估(在匿名过程中丢失了多少信息?)和去匿名测试(人类可以从匿名文本数据中识别个人?)。研究结果表明,文本处理的性能类似于最新的实体识别模型,并引入了可忽略的信息损失0.84%。对于De-nonymisation测试,我们任命人类从众包人的描述数据集中对非常著名,半著名和不存在的个人的描述来识别个人。该工具的现实用例的匿名率范围为1.01-2.01%。我们在第二项研究中复制了发现,并得出结论,TextWash成功地删除了潜在的敏感信息,从而使详细的人描述实际上是匿名的。

The increased use of text data in social science research has benefited from easy-to-access data (e.g., Twitter). That trend comes at the cost of research requiring sensitive but hard-to-share data (e.g., interview data, police reports, electronic health records). We introduce a solution to that stalemate with the open-source text anonymisation software_Textwash_. This paper presents the empirical evaluation of the tool using the TILD criteria: a technical evaluation (how accurate is the tool?), an information loss evaluation (how much information is lost in the anonymisation process?) and a de-anonymisation test (can humans identify individuals from anonymised text data?). The findings suggest that Textwash performs similar to state-of-the-art entity recognition models and introduces a negligible information loss of 0.84%. For the de-anonymisation test, we tasked humans to identify individuals by name from a dataset of crowdsourced person descriptions of very famous, semi-famous and non-existing individuals. The de-anonymisation rate ranged from 1.01-2.01% for the realistic use cases of the tool. We replicated the findings in a second study and concluded that Textwash succeeds in removing potentially sensitive information that renders detailed person descriptions practically anonymous.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源