分布式系统中的自我修复困境：故障校正与容错性

论文标题

分布式系统中的自我修复困境：故障校正与容错性

Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

论文作者

Nikolic, Jovan, Jubatyrov, Nursultan, Pournaras, Evangelos

论文摘要

通过异步通信进行交互的自主代理的大规模分散系统通常会遇到以下自我修复困境：故障检测继承网络不确定性，使远程错误过程与缓慢的过程无法区分。如果过程缓慢而没有故障，则故障校正是不可取的，因为它可以触发新的故障，而这些故障可能会以更为主动的系统维护来防止可防止的故障。但是，在实际错误过程的情况下，仅限容错而没有最终纠正持续断层的情况就可以使系统的表现不佳。鉴于分布式分类帐，边缘计算，几种能源，运输和健康应用中的物联网，测量，理解和解决这种自我修复困境是及时的挑战和关键要求。本文在系统运行时对故障场景进行了新颖的通用建模。它们用于准确测量和预测由故障校正和容错的不良结果所产生的不一致之处，作为在设计阶段改善大规模分散系统自我修复的手段。设计了一种严格的实验方法，该方法可以评估3000个节点的原型分散化网络中不同故障尺度，故障曲线和故障检测阈值的696个实验设置。在网络中收集了近900万个不一致的测量，每个节点都会监视另一个节点的健康状况，而两者都可能缺陷。使用智能电网试点项目中的现实世界数据，在具有挑战性和动态的网络内数据聚合的充满挑战的应用程序方案中验证了建模故障场景的预测性能。调查结果证实了设计阶段不一致的起源。

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process indistinguishable from a slow process. In the case of a slow process without fault, fault correction is undesirable as it can trigger new faults that could be prevented with fault tolerance that is a more proactive system maintenance. But in the case of an actual faulty process, fault tolerance alone without eventually correcting persistent faults can make systems underperforming. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several energy, transport and health applications. This paper contributes a novel and general-purpose modeling of fault scenarios during system runtime. They are used to accurately measure and predict inconsistencies generated by the undesirable outcomes of fault correction and fault tolerance as the means to improve self-healing of large-scale decentralized systems at the design phase. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds in a prototyped decentralized network of 3000 nodes. Almost 9 million measurements of inconsistencies were collected in a network, where each node monitors the health status of another node, while both can defect. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network data aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase.

下载PDF全文

下载文献需遵守相关版权规定

论文标题