论文标题
基准资源的基础数据类型的Apache Spark的基础数据类型
Benchmarking Resource Usage of Underlying Datatypes of Apache Spark
论文作者
论文摘要
本文的目的是检查分析的资源使用如何受Spark Analytics的不同基础数据类型的影响 - 弹性分布式数据集(RDDS),数据集和数据范围。分析的资源使用量被探讨为基准大数据分析的可行且首选的替代方法,而不是使用执行时间执行的当前常见基准测试。分析的运行时间被证明不能保证为可重复的度量,因为工作的许多外部因素可能会影响执行时间。取而代之的是,在包括峰值执行内存在内的Spark上随时可用的指标用于基准在Spark Analytics的常见应用程序中,例如计数,缓存,重新分配和KMeans,将这些不同数据类型的资源使用。
The purpose of this paper is to examine how resource usage of an analytic is affected by the different underlying datatypes of Spark analytics - Resilient Distributed Datasets (RDDs), Datasets, and DataFrames. The resource usage of an analytic is explored as a viable and preferred alternative of benchmarking big data analytics instead of the current common benchmarking performed using execution time. The run time of an analytic is shown to not be guaranteed to be a reproducible metric since many external factors to the job can affect the execution time. Instead, metrics readily available through Spark including peak execution memory are used to benchmark the resource usage of these different datatypes in common applications of Spark analytics, such as counting, caching, repartitioning, and KMeans.