论文标题

Chimbuko:工作流级可扩展性能跟踪分析工具

Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool

论文作者

Ha, Sungsoo, Jeong, Wonyong, Matyasfalvi, Gyorgy, Xie, Cong, Huck, Kevin, Choi, Jong Youl, Malik, Abid, Tang, Li, Van Dam, Hubertus, Pouchard, Line, Xu, Wei, Yoo, Shinjae, D'Imperio, Nicholas, Van Dam, Kerstin Kleese

论文摘要

由于目前有限制输入/输出系统对高性能计算系统施加,因此新一代的工作流程包括在线数据减少和分析。诊断其性能需要由于执行模式的复杂性和基础硬件的复杂性而需要复杂的性能分析功能,而且没有工具能够处理检测潜在问题所需的大量性能跟踪数据。这项工作介绍了Chimbuko,这是一个提供实时,分布式和原位异常检测的性能分析框架。对于人级处理而不会丢失必要的细节,可以减少数据量。 Chimbuko通过可视化模块支持在线性能监视,该模块呈现总体工作流程异常分布,呼叫堆栈和时间表。 Chimbuko还支持捕获和降低绩效出处。据我们所知,Chimbuko是第一个在线,分发和可扩展的工作流程级绩效痕迹分析框架,我们证明了该工具对Oak Ridge国家实验室的峰会系统的有用性。

Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance trace data needed to detect potential problems. This work introduces Chimbuko, a performance analysis framework that provides real-time, distributed, in situ anomaly detection. Data volumes are reduced for human-level processing without losing necessary details. Chimbuko supports online performance monitoring via a visualization module that presents the overall workflow anomaly distribution, call stacks, and timelines. Chimbuko also supports the capture and reduction of performance provenance. To the best of our knowledge, Chimbuko is the first online, distributed, and scalable workflow-level performance trace analysis framework, and we demonstrate the tool's usefulness on Oak Ridge National Laboratory's Summit system.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源