论文标题
在多维数据流中启用有效和一般的亚群分析
Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams
论文作者
论文摘要
当今的大规模服务(例如,视频流平台,数据中心,传感器电网)需要多种多维数据集的多个亚群中的各种实时摘要统计信息。但是,最新的框架并未以合理的成本实时提供一般和准确的分析。根本原因是数据亚群的组合爆炸以及我们需要同时监视的摘要统计数据的多样性。我们提出了Hydra,这是用于多维分析的有效框架,它呈现出了使用``草图'''的新型组合,以避免监视指数良好的亚群和通用草图的开销,以确保对多个统计数据进行准确的估计。我们将HYDRA作为Apache火花插件,并应对实用的系统挑战,以最大程度地减少开销。在多个现实世界和综合多维数据集中,我们表明Hydra可以达到可靠的误差界限,并且比现有框架(例如Spark,Druid)在确保交互式估计时间的同时,在运营成本和内存足迹上的效率更高。
Today's large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a ``sketch of sketches'' to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times.