TextWash-自动开源文本匿名

论文标题

TextWash-自动开源文本匿名

Textwash -- automated open-source text anonymisation

论文作者

Kleinberg, Bennett, Davies, Toby, Mozes, Maximilian

论文摘要

社会科学研究中文本数据的使用增加从易于获取的数据（例如Twitter）中受益。这一趋势是以研究成本需要敏感但难以分享的数据的成本（例如，访谈数据，警察报告，电子健康记录）。我们使用开源文本匿名软件_textwash_介绍了该僵局的解决方案。本文使用TILD标准对工具进行了经验评估：技术评估（工具的准确性？），信息丢失评估（在匿名过程中丢失了多少信息？）和去匿名测试（人类可以从匿名文本数据中识别个人？）。研究结果表明，文本处理的性能类似于最新的实体识别模型，并引入了可忽略的信息损失0.84％。对于De-nonymisation测试，我们任命人类从众包人的描述数据集中对非常著名，半著名和不存在的个人的描述来识别个人。该工具的现实用例的匿名率范围为1.01-2.01％。我们在第二项研究中复制了发现，并得出结论，TextWash成功地删除了潜在的敏感信息，从而使详细的人描述实际上是匿名的。

The increased use of text data in social science research has benefited from easy-to-access data (e.g., Twitter). That trend comes at the cost of research requiring sensitive but hard-to-share data (e.g., interview data, police reports, electronic health records). We introduce a solution to that stalemate with the open-source text anonymisation software_Textwash_. This paper presents the empirical evaluation of the tool using the TILD criteria: a technical evaluation (how accurate is the tool?), an information loss evaluation (how much information is lost in the anonymisation process?) and a de-anonymisation test (can humans identify individuals from anonymised text data?). The findings suggest that Textwash performs similar to state-of-the-art entity recognition models and introduces a negligible information loss of 0.84%. For the de-anonymisation test, we tasked humans to identify individuals by name from a dataset of crowdsourced person descriptions of very famous, semi-famous and non-existing individuals. The de-anonymisation rate ranged from 1.01-2.01% for the realistic use cases of the tool. We replicated the findings in a second study and concluded that Textwash succeeds in removing potentially sensitive information that renders detailed person descriptions practically anonymous.

下载PDF全文

下载文献需遵守相关版权规定

论文标题