批处理概率增量数据清洁

论文标题

批处理概率增量数据清洁

Batchwise Probabilistic Incremental Data Cleaning

论文作者

Oliveira, Paulo H., Kaster, Daniel S., Traina-Jr., Caetano, Ilyas, Ihab F.

论文摘要

缺乏数据和数据质量问题是阻止许多组织中进一步采用人工智能的主要瓶颈，从而推动数据科学家花费大部分时间清理数据，然后才能回答分析问题。因此，需要更有效，有效的数据清洁解决方案，这毫不奇怪，这充满了理论和工程问题。本报告给定固定的规则集和在顺序批处理中获取的不断发展的分类关系数据集，解决了逐步执行整体数据清洁的问题。据我们所知，我们的贡献构成了第一个增量框架（i）独立于用户干预，（ii），而无需了解有关传入数据集的知识，例如每个属性的类数量，（iii）从整体上，启用多个错误类型可以同时修复，从而避免进行重复进行冲突。广泛的实验表明，我们的方法在维修质量，执行时间和记忆消耗方面优于竞争对手。

Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to answer analytical questions. Hence, there is a need for more effective and efficient data cleaning solutions, which, not surprisingly, is rife with theoretical and engineering problems. This report addresses the problem of performing holistic data cleaning incrementally, given a fixed rule set and an evolving categorical relational dataset acquired in sequential batches. To the best of our knowledge, our contributions compose the first incremental framework that cleans data (i) independently of user interventions, (ii) without requiring knowledge about the incoming dataset, such as the number of classes per attribute, and (iii) holistically, enabling multiple error types to be repaired simultaneously, and thus avoiding conflicting repairs. Extensive experiments show that our approach outperforms the competitors with respect to repair quality, execution time, and memory consumption.

下载PDF全文

下载文献需遵守相关版权规定

论文标题