基于算法的结合梯度方法基于算法的检查点恢复

论文标题

基于算法的结合梯度方法基于算法的检查点恢复

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

论文作者

Pachajoa, Carlos, Pacher, Christina, Levonyak, Markus, Gansterer, Wilfried N.

论文摘要

随着计算机到达Exascale及以后，故障的发生率将增加。解决这个问题的解决方案是一个积极的研究主题。我们专注于制作预处理的共轭梯度（PCG）求解器针对节点故障的弹性，特别是确切的状态重建方法（ESR）方法，该方法利用了PCG中的冗余。降低存储冗余信息的频率会减少运行时开销。但是，在节点故障后，求解器必须从存储冗余信息的最后一次迭代中重新启动，从而增加恢复开销。该公式突出了该方法与检查点 - 重点（CR）的相似之处。因此，我们称之为周期性存储（ESRP）的ESR的方法可以视为基于算法的检查点 - 重点的一种形式。通过利用该算法固有的冗余，而不是明确的，则该状态是隐式存储的。与CR相比，我们还将要存储和检索的数据量最小化，但是重建求解器的状态需要进行其他计算。在本文中，我们描述了将其转换为ESRP的必要修改，并进行实验评估。我们将ESRP与先前存在的ESR和应用程序级内存中CR进行比较。我们的结果证实，在无故障情况下以及是否引入了节点失败，ESR的开销显着降低了。在前一种情况下，ESRP的开销通常低于CR。但是，如果发生节点失败，则CR更快。我们声称，实施更合适的预处理可以缓解这些差异。

As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, specifically, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpoint-restart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modifications to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results confirm that the overhead for ESR is reduced significantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these differences can be alleviated by the implementation of more appropriate preconditioners.

下载PDF全文

下载文献需遵守相关版权规定

论文标题