基准资源的基础数据类型的Apache Spark的基础数据类型

论文标题

基准资源的基础数据类型的Apache Spark的基础数据类型

Benchmarking Resource Usage of Underlying Datatypes of Apache Spark

论文作者

Nicholls, Brittany, Adangwa, Mariama, Estes, Rachel, Iradukunda, Hugues Nelson, Zhang, Qingquan, Zhu, Ting

论文摘要

本文的目的是检查分析的资源使用如何受Spark Analytics的不同基础数据类型的影响 - 弹性分布式数据集（RDDS），数据集和数据范围。分析的资源使用量被探讨为基准大数据分析的可行且首选的替代方法，而不是使用执行时间执行的当前常见基准测试。分析的运行时间被证明不能保证为可重复的度量，因为工作的许多外部因素可能会影响执行时间。取而代之的是，在包括峰值执行内存在内的Spark上随时可用的指标用于基准在Spark Analytics的常见应用程序中，例如计数，缓存，重新分配和KMeans，将这些不同数据类型的资源使用。

The purpose of this paper is to examine how resource usage of an analytic is affected by the different underlying datatypes of Spark analytics - Resilient Distributed Datasets (RDDs), Datasets, and DataFrames. The resource usage of an analytic is explored as a viable and preferred alternative of benchmarking big data analytics instead of the current common benchmarking performed using execution time. The run time of an analytic is shown to not be guaranteed to be a reproducible metric since many external factors to the job can affect the execution time. Instead, metrics readily available through Spark including peak execution memory are used to benchmark the resource usage of these different datatypes in common applications of Spark analytics, such as counting, caching, repartitioning, and KMeans.

下载PDF全文

下载文献需遵守相关版权规定

论文标题