加强近似探索性数据分析

论文标题

加强近似探索性数据分析

Reinforced Approximate Exploratory Data Analysis

论文作者

Garg, Shaddy, Mitra, Subrata, Yu, Tong, Gadhia, Yash, Kashettiwar, Arjun

论文摘要

探索性数据分析（EDA）是一个顺序决策过程，分析师选择后续查询，这些查询可能会根据先前的查询和相应的结果导致一些有趣的见解。数据处理系统通常会在样本上执行查询，以产生延迟较低的结果。不同的下采样策略保留了数据的不同统计数据，并具有不同的延迟减少。采样策略的最佳选择通常取决于分析流的特定上下文和分析师的隐藏意图。在本文中，我们是第一个考虑在交互式数据探索设置中引入近似错误时采样的影响。我们提出了一个基于深入的增强学习（DRL）框架，可以优化样本选择，以保持分析和洞察力的完整流程。使用3个真实数据集进行评估表明，与基线方法相比，我们的技术可以在改善相互作用潜伏期的同时保留原始的见识产生流。

Exploratory data analytics (EDA) is a sequential decision making process where analysts choose subsequent queries that might lead to some interesting insights based on the previous queries and corresponding results. Data processing systems often execute the queries on samples to produce results with low latency. Different downsampling strategy preserves different statistics of the data and have different magnitude of latency reductions. The optimum choice of sampling strategy often depends on the particular context of the analysis flow and the hidden intent of the analyst. In this paper, we are the first to consider the impact of sampling in interactive data exploration settings as they introduce approximation errors. We propose a Deep Reinforcement Learning (DRL) based framework which can optimize the sample selection in order to keep the analysis and insight generation flow intact. Evaluations with 3 real datasets show that our technique can preserve the original insight generation flow while improving the interaction latency, compared to baseline methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题