预训练的变压器可以用于检测复杂的敏感句子吗？ - 孟山都案例研究

论文标题

预训练的变压器可以用于检测复杂的敏感句子吗？ - 孟山都案例研究

Can pre-trained Transformers be used in detecting complex sensitive sentences? -- A Monsanto case study

论文作者

Timmer, Roelien C., Liebowitz, David, Nepal, Surya, Kanhere, Salil S.

论文摘要

每个组织都以各种形式释放信息，从年度报告到法律程序。此类文档可能包含敏感信息并公开发布这些信息可能会导致机密信息的泄漏。检测文档中包含敏感信息的句子可以帮助组织防止有价值的机密信息泄漏。当这些句子包含大量信息或是已知敏感内容的释义版本时，这尤其具有挑战性。当前在此类复杂设置中检测敏感信息检测的方法基于基于关键字的方法或标准的机器学习模型。在本文中，我们希望探讨是否非常适合检测复杂的敏感信息。经过预训练的变压器通常接受大量文本的培训，因此很容易学习语法，结构和其他语言特征，使其在此任务中特别有吸引力。通过我们在孟山都试验数据集的实验，我们观察到来自变形金刚（BERT）变压器模型的微调双向编码器表示的性能要比传统模型更好。我们在孟山都数据集中尝试了四种不同类别的文档，并观察到，幽灵的F2得分更好24.13 \％至65.79 \％，有毒有毒的39.22 \％，39.22 \％，Chemi的39.22 \％，Chemi的39.22 \ for Chemi，53.57 \％\％\％\％\％\％\％，53.57 \％\％\％\％。

Each and every organisation releases information in a variety of forms ranging from annual reports to legal proceedings. Such documents may contain sensitive information and releasing them openly may lead to the leakage of confidential information. Detection of sentences that contain sensitive information in documents can help organisations prevent the leakage of valuable confidential information. This is especially challenging when such sentences contain a substantial amount of information or are paraphrased versions of known sensitive content. Current approaches to sensitive information detection in such complex settings are based on keyword-based approaches or standard machine learning models. In this paper, we wish to explore whether pre-trained transformer models are well suited to detect complex sensitive information. Pre-trained transformers are typically trained on an enormous amount of text and therefore readily learn grammar, structure and other linguistic features, making them particularly attractive for this task. Through our experiments on the Monsanto trial data set, we observe that the fine-tuned Bidirectional Encoder Representations from Transformers (BERT) transformer model performs better than traditional models. We experimented with four different categories of documents in the Monsanto dataset and observed that BERT achieves better F2 scores by 24.13\% to 65.79\% for GHOST, 30.14\% to 54.88\% for TOXIC, 39.22\% for CHEMI, 53.57\% for REGUL compared to existing sensitive information detection models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题