论文标题
快速近似聚集,并具有分布敏感的间隔保证
Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees
论文作者
论文摘要
汇总数据是数据分析,数据探索和OLAP的基础。近似查询处理(AQP)技术通常用于使用样品加速骨料的计算,该样品广泛使用置信区间(CIS)来量化相关误差。实践中使用的CI分为两类:紧密但不正确的技术,即它们产生紧密的间隔,但仅提供渐近保证,使其不可靠或正确但不正确但不紧密的技术,即它们提供了严格的保证,但是它们过于保守,导致了置信区间,导致置信区间太松了,无法有用。在本文中,我们开发了一种比传统方法既正确又更紧密的CI技术。从保守的独联体开始,我们发现了他们经常面临的两个问题:悲观的质量分配(PMA)和幻影异常值敏感性(PHOS)。通过开发一种新颖的范围修整技术,用于消除PHO并将其与无PMA的已知CI技术配对,我们开发了一种计算具有强大保证的CIS的技术,该技术需要更少的样品才能获得相同的宽度。我们在采样优化的内存列存储下实施了我们的技术,并展示了如何加速在真实数据集中涉及聚集体的查询,而在传统的aqp-with-with-with-withEsees上的加速度高达124倍,而超过1000 x的查询则超过1000x。
Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptotic guarantees, making them unreliable, or techniques that are correct but not tight, i.e., they offer rigorous guarantees, but are overly conservative, leading to confidence intervals that are too loose to be useful. In this paper, we develop a CI technique that is both correct and tighter than traditional approaches. Starting from conservative CIs, we identify two issues they often face: pessimistic mass allocation (PMA) and phantom outlier sensitivity (PHOS). By developing a novel range-trimming technique for eliminating PHOS and pairing it with known CI techniques without PMA, we develop a technique for computing CIs with strong guarantees that requires fewer samples for the same width. We implement our techniques underneath a sampling-optimized in-memory column store and show how to accelerate queries involving aggregates on a real dataset with speedups of up to 124x over traditional AQP-with-guarantees and more than 1000x over exact methods.