CAUSNET：基于世代订购的搜索，通过动态编程和父套件设置约束，以搜索最佳贝叶斯网络

论文标题

CAUSNET：基于世代订购的搜索，通过动态编程和父套件设置约束，以搜索最佳贝叶斯网络

CausNet : Generational orderings based search for optimal Bayesian networks via dynamic programming with parent set constraints

论文作者

Sharma, Nand, Millstein, Joshua

论文摘要

使用详尽的搜索找到全球最佳的贝叶斯网络是超出指数复杂性的问题，它严重限制了它可以使用的变量数量。我们实现了一种基于动态编程的算法，具有内置维度降低和父集识别。这大大降低了搜索空间，可以应用于大维数据。我们使用所谓的基于世代订购的搜索对最佳网络的搜索，这是一种新颖的方式，可以有效地搜索可能的网络空间。该算法支持连续和分类数据，以及分类和生存结果。我们证明了算法对合成和真实数据的功效。在模拟中，我们的算法的性能优于目前广泛使用的三种最新算法。然后，我们将其应用于具有513个基因和生存结果的卵巢癌基因表达数据集。我们的算法能够找到一个最佳网络，描述了由6个基因组成的疾病途径，该基因在基本计算机上几分钟内导致结果节点。我们基于世代订购的最佳网络的搜索是找到最佳贝叶斯网络的高效和高度可扩展的方法，可以应用于1000个变量。使用特定的参数 - 相关性，FDR截止和内度 - 可以增加或减少网络的节点和密度的数量。两个评分期权BIC和BGE的可用性以及生存结果和混合数据类型的实施使我们的算法非常适合许多类型的高维生物医学数据，以找到疾病途径。

Finding a globally optimal Bayesian Network using exhaustive search is a problem with super-exponential complexity, which severely restricts the number of variables that it can work for. We implement a dynamic programming based algorithm with built-in dimensionality reduction and parent set identification. This reduces the search space drastically and can be applied to large-dimensional data. We use what we call generational orderings based search for optimal networks, which is a novel way to efficiently search the space of possible networks given the possible parent sets. The algorithm supports both continuous and categorical data, and categorical as well as survival outcomes. We demonstrate the efficacy of our algorithm on both synthetic and real data. In simulations, our algorithm performs better than three state-of-art algorithms that are currently used extensively. We then apply it to an Ovarian Cancer gene expression dataset with 513 genes and a survival outcome. Our algorithm is able to find an optimal network describing the disease pathway consisting of 6 genes leading to the outcome node in a few minutes on a basic computer. Our generational orderings based search for optimal networks, is both efficient and highly scalable approach to finding optimal Bayesian Networks, that can be applied to 1000s of variables. Using specifiable parameters - correlation, FDR cutoffs, and in-degree - one can increase or decrease the number of nodes and density of the networks. Availability of two scoring option-BIC and Bge-and implementation of survival outcomes and mixed data types makes our algorithm very suitable for many types of high dimensional biomedical data to find disease pathways.

下载PDF全文

下载文献需遵守相关版权规定

论文标题