论文标题
TF-DARSHAN:了解机器学习工作负载中的细粒度I/O性能
tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads
论文作者
论文摘要
近年来,HPC系统上的机器学习应用已越来越受欢迎。即将到来的大型系统将为通过GPU培训提供巨大的并行性。但是,机器学习的另一个很大的方面是I/O,这可能是性能瓶颈。 TensorFlow是最受欢迎的深度学习平台之一,现在提供了一个新的Profiler界面,并允许使用TensorFlow操作进行仪器。但是,当前的参考器仅在TensorFlow平台级别上启用分析,并且不提供系统级信息。在本文中,我们扩展了Tensorflow Profiler,并介绍了通过Darshan执行仪器的TF-DARSHAN,既是Profiler and Tracer)。我们使用相同的Darshan共享仪器库,并在不使用系统预加载的情况下实现运行时附件。我们可以在TensorFlow执行过程中提取Darshan分析数据结构,以通过TensorFlow Profiler启用分析。我们通过Tensorboard(基于Web的TensorFlow可视化工具)可视化性能结果。同时,我们不会改变Darshan的现有实施。我们通过对ImageNet图像和恶意软件分类进行两个案例研究来说明TF-Darshan。我们表明,通过使用来自TF-Darshan的数据引导优化,我们通过选择用于在快速层存储上分期的数据来增加POSIX I/O带宽最多19%。我们还表明,达山(Darshan)有可能被用作运行时库,用于分析并提供信息以进行未来的优化。
Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan's existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization.