关于使用可解释的机器学习来管理数据质量

论文标题

关于使用可解释的机器学习来管理数据质量

On the Use of Interpretable Machine Learning for the Management of Data Quality

论文作者

Karanika, Anna, Oikonomou, Panagiotis, Kolomvatsos, Kostas, Anagnostopoulos, Christos

论文摘要

对于任何要求分析以支持决策的应用程序，数据质量都是一个重要的问题。当我们专注于物联网（IoT）时，许多设备可以交互以交换和处理数据，这变得非常重要。物联网设备连接到Edge Computing（EC）节点以报告收集的数据，因此，我们不仅必须在IoT上，而且还必须在网络边缘确保数据质量。在本文中，我们专注于特定问题，并提出使用可解释的机器学习来提供重要的功能，这些功能对于任何数据处理活动都重要。我们的目的是至少对于在收集的数据集中被检测到的那些功能至少确保数据质量。我们必须注意，所选功能描述了与每个数据集中剩余的最高相关性，因此可以将它们用于降低维度。我们专注于在学习模型中具有可解释性的多种方法，并采用合奏方案进行最终决定。我们的方案能够及时检索最终结果并有效地选择适当的功能。我们通过广泛的模拟评估我们的模型，并呈现数值结果。我们的目的是在各种实验场景下揭示其性能，我们创建了不同机制中采用的一组参数。

Data quality is a significant issue for any application that requests for analytics to support decision making. It becomes very important when we focus on Internet of Things (IoT) where numerous devices can interact to exchange and process data. IoT devices are connected to Edge Computing (EC) nodes to report the collected data, thus, we have to secure data quality not only at the IoT but also at the edge of the network. In this paper, we focus on the specific problem and propose the use of interpretable machine learning to deliver the features that are important to be based for any data processing activity. Our aim is to secure data quality, at least, for those features that are detected as significant in the collected datasets. We have to notice that the selected features depict the highest correlation with the remaining in every dataset, thus, they can be adopted for dimensionality reduction. We focus on multiple methodologies for having interpretability in our learning models and adopt an ensemble scheme for the final decision. Our scheme is capable of timely retrieving the final result and efficiently select the appropriate features. We evaluate our model through extensive simulations and present numerical results. Our aim is to reveal its performance under various experimental scenarios that we create varying a set of parameters adopted in our mechanism.

下载PDF全文

下载文献需遵守相关版权规定

论文标题