论文标题
采矿根本原因来自云服务事件的知识调查
Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
论文作者
论文摘要
任何破坏服务事件的根本原因分析(RCA)是IT过程中最关键和复杂的任务之一,尤其是对于像Salesforce这样的云行业领导者。通常情况下,RCA调查利用了数据源,例如应用程序错误日志或服务呼叫跟踪。然而,在域专家对过去事件调查的自然语言文档中,也隐藏了富含根本原因信息的金矿。通常将其称为问题审查委员会(PRB)数据,构成IT事件管理的核心组成部分。但是,由于PRB的原始非结构化性质,这种根本原因知识无法通过手动或自动化的新事件RCA直接重复使用。这促使我们利用SOTA神经NLP技术来提取目标信息并从PRB文档中构建结构化的因果知识图,从而利用这种广泛的数据包来构建事件因果分析(ICA)引擎。 ICA通过信息检索系统搜索和对过去的事件进行搜索和对新事件的简单效果回收RCA的骨干形成,并在事件症状中检测出可能的根本原因。在这项工作中,我们介绍了在Salesforce建造的ICA和下游事件搜索和基于检索的RCA管道,该管道已超过2K记录了几年的云服务事件调查。我们还通过各种定量基准,定性分析以及域专家的验证以及部署后的真实事件研究来确定ICA和下游任务的有效性。
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources like application error logs or service call traces. However a rich goldmine of root cause information is also hidden in the natural language documentation of the past incidents investigations by domain experts. This is generally termed as Problem Review Board (PRB) Data which constitute a core component of IT Incident Management. However, owing to the raw unstructured nature of PRBs, such root cause knowledge is not directly reusable by manual or automated pipelines for RCA of new incidents. This motivates us to leverage this widely-available data-source to build an Incident Causation Analysis (ICA) engine, using SoTA neural NLP techniques to extract targeted information and construct a structured Causal Knowledge Graph from PRB documents. ICA forms the backbone of a simple-yet-effective Retrieval based RCA for new incidents, through an Information Retrieval system to search and rank past incidents and detect likely root causes from them, given the incident symptom. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce, over 2K documented cloud service incident investigations collected over a few years. We also establish the effectiveness of ICA and the downstream tasks through various quantitative benchmarks, qualitative analysis as well as domain expert's validation and real incident case studies after deployment.