到处都有高性能数据工程

论文标题

到处都有高性能数据工程

High Performance Data Engineering Everywhere

论文作者

Widanage, Chathura, Perera, Niranda, Abeykoon, Vibhatha, Kamburugamuve, Supun, Kanewala, Thejaka Amila, Maithree, Hasara, Wickramasinghe, Pulasthi, Uyar, Ahmet, Gunduz, Gurhan, Fox, Geoffrey

论文摘要

在机器和深度学习领域取得的惊人进步是企业和研究社区的大数据时代的亮点。现代应用程序需要超越单个节点提供的资源。但是，这只是整体数据处理环境所面临的问题的一小部分，这还必须支持大量数据工程，用于数据前和DATA后处理，通信和系统集成。数据分析工具的一个重要要求是能够轻松地与多种语言中的现有框架集成，从而提高用户的生产率和效率。所有这些都需要一种有效且高度分布的集成方法来进行数据处理，但是当今许多流行的数据分析工具都无法同时满足所有这些要求。在本文中，我们介绍了Cylon，这是一个开源高性能分布式数据处理库，可以与现有的大数据和AI/ML框架无缝集成。它是在紧凑的数据结构上具有灵活的C ++核心开发的，并揭示了语言绑定到C ++，Java和Python。我们详细讨论Cylon的体系结构，并揭示如何将其作为库将其导入到现有应用程序中，或作为独立框架运行。最初的实验表明，Cylon增强了流行的工具，例如Apache Spark和Dask，具有重大性能改进，可用于关键操作和更好的组件链接。最后，我们展示其设计如何使Cylon与最小开销的跨平台一起使用，其中包括流行的AI工具，例如Pytorch，Tensorflow和Jupyter Notebooks。

The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题