论文标题
关于不同数据帧处理库的能耗 - 一项探索性研究
On the Energy Consumption of Different Dataframe Processing Libraries -- An Exploratory Study
论文作者
论文摘要
背景:机器学习的能耗及其对环境的影响使能源有效的ML成为新兴领域的研究领域。但是,大多数关注都集中在模型创建以及培训和推论阶段上。预处理,清洁和探索性分析等面向数据的阶段构成了机器学习工作流程的关键部分。但是,这些阶段的能源效率几乎没有引起研究人员的关注。目的:我们的研究旨在探索不同数据框架处理库的能源消耗,这是研究机器学习管道面向数据阶段的能源效率的第一步。方法:我们测量了3个流行库的能耗,用于与数据范围合作,即PANDAS,VAEX和DASK,用于在2个数据集中进行4个类别的21种不同操作。结果:我们的分析结果表明,对于给定的数据框架处理操作,图书馆的选择确实可以影响能源消耗,而某些图书馆的消耗却少了202倍。结论:我们的研究结果表明,有可能优化机器学习管道面向数据阶段的能源消耗,并在方向上需要进一步的研究。
Background: The energy consumption of machine learning and its impact on the environment has made energy efficient ML an emerging area of research. However, most of the attention stays focused on the model creation and the training and inferencing phase. Data oriented stages like preprocessing, cleaning and exploratory analysis form a critical part of the machine learning workflow. However, the energy efficiency of these stages have gained little attention from the researchers. Aim: Our study aims to explore the energy consumption of different dataframe processing libraries as a first step towards studying the energy efficiency of the data oriented stages of the machine learning pipeline. Method: We measure the energy consumption of 3 popular libraries used to work with dataframes, namely Pandas, Vaex and Dask for 21 different operations grouped under 4 categories on 2 datasets. Results: The results of our analysis show that for a given dataframe processing operation, the choice of library can indeed influence the energy consumption with some libraries consuming 202 times lesser energy over others. Conclusion: The results of our study indicates that there is a potential for optimizing the energy consumption of the data oriented stages of the machine learning pipeline and further research is needed in the direction.