论文标题
安然语料库:电子邮件尸体被埋葬在哪里?
The Enron Corpus: Where the Email Bodies are Buried?
论文作者
论文摘要
为了探究最大的欺诈指标的公共域电子邮件数据库,我们应用机器学习并完成四项调查任务。首先,我们使用财务记录和电子邮件确定感兴趣的人(POI),并报告峰准确度为95.7%。其次,我们发现任何公开暴露的个人身份信息(PII),发现50,000个以前未报告的实例。第三,我们会自动标记加利福尼亚电力停电诉讼中人类专家评分的法律响应式电子邮件,并找到99%的准确性。最后,我们在公司危机开始之前,之中和之后,追踪了10,000多名独特的人的三年主要主题和情感。在可能的情况下,我们将准确性与51个算法的执行时间进行比较,并报告可以扩展到大量数据集的人类解剖业务规则。
To probe the largest public-domain email database for indicators of fraud, we apply machine learning and accomplish four investigative tasks. First, we identify persons of interest (POI), using financial records and email, and report a peak accuracy of 95.7%. Secondly, we find any publicly exposed personally identifiable information (PII) and discover 50,000 previously unreported instances. Thirdly, we automatically flag legally responsive emails as scored by human experts in the California electricity blackout lawsuit, and find a peak 99% accuracy. Finally, we track three years of primary topics and sentiment across over 10,000 unique people before, during and after the onset of the corporate crisis. Where possible, we compare accuracy against execution times for 51 algorithms and report human-interpretable business rules that can scale to vast datasets.