论文标题
失败和修复:软件系统事件响应的研究
Failures and Fixes: A Study of Software System Incident Response
论文作者
论文摘要
本文介绍了与软件系统失败有关的研究结果,目的是了解我们如何更好地发展,维护和支持生产中的软件系统。我们已经定性地分析了30起事件:通过与工程师进行深入访谈,从公开发表的事件报告中采样了15起(通常是作为死后评论的一部分)进行的。我们的分析重点是理解和分类失败的发生方式,以及如何被检测,调查和缓解。我们还以11个关键观察的形式捕获了与实践状态和相关挑战相关的分析见解。例如,我们观察到,故障可以通过导致大规模停电的系统级联。而且,在超过这些限制之前,工程师通常不理解他们支持的系统的缩放限制。我们认为,我们确定的挑战可以改善系统的工程和支持。
This paper presents the results of a research study related to software system failures, with the goal of understanding how we might better evolve, maintain and support software systems in production. We have qualitatively analyzed thirty incidents: fifteen collected through in depth interviews with engineers, and fifteen sampled from publicly published incident reports (generally produced as part of postmortem reviews). Our analysis focused on understanding and categorizing how failures occurred, and how they were detected, investigated and mitigated. We also captured analytic insights related to the current state of the practice and associated challenges in the form of 11 key observations. For example, we observed that failures can cascade through a system leading to major outages; and that often engineers do not understand the scaling limits of systems they are supporting until those limits are exceeded. We argue that the challenges we have identified can lead to improvements to how systems are engineered and supported.