论文标题
将森林随机分配以进行点和集体异常检测 - 侵入检测应用
Random Partitioning Forest for Point-Wise and Collective Anomaly Detection -- Application to Intrusion Detection
论文作者
论文摘要
在本文中,我们提出了DIFF-RF,这是一种由随机分配的二进制树组成的合奏方法,以检测到点和集体(以及上下文)异常。得益于树木叶子上使用的基于距离的范式,这种半监督的方法解决了在隔离森林(如果)算法中已确定的缺点。此外,考虑到随机树叶片中访问的频率,在考虑存在集体异常时,可以显着提高DIFF-RF的性能。 DIFF-RF非常容易训练,并且可以通过使用简单的半监督程序来设置引入的额外的超参数来获得出色的性能。我们首先在综合数据集上评估diff-rf i)验证IF算法的限制是克服的,ii)证明如何实际检测到集体异常,并iiii)分析其所涉及的元参数的效果。我们评估了UCI存储库一组数据集上的DIFF-RF算法,以及与入侵检测应用相关的两个基准。我们的实验表明,DIFF-RF几乎系统地胜过IF算法,但也挑战了单级SVM基线和深度学习变异自动编码器体系结构。此外,我们的经验表明,DIFF-RF在存在小型学习数据的情况下可以很好地工作,这对于深层神经体系结构来说是困难的。最后,DIFF-RF在计算上是有效的,并且可以在多核体系结构上轻松平行。
In this paper, we propose DiFF-RF, an ensemble approach composed of random partitioning binary trees to detect point-wise and collective (as well as contextual) anomalies. Thanks to a distance-based paradigm used at the leaves of the trees, this semi-supervised approach solves a drawback that has been identified in the isolation forest (IF) algorithm. Moreover, taking into account the frequencies of visits in the leaves of the random trees allows to significantly improve the performance of DiFF-RF when considering the presence of collective anomalies. DiFF-RF is fairly easy to train, and excellent performance can be obtained by using a simple semi-supervised procedure to setup the extra hyper-parameter that is introduced. We first evaluate DiFF-RF on a synthetic data set to i) verify that the limitation of the IF algorithm is overcome, ii) demonstrate how collective anomalies are actually detected and iii) to analyze the effect of the meta-parameters it involves. We assess the DiFF-RF algorithm on a large set of datasets from the UCI repository, as well as two benchmarks related to intrusion detection applications. Our experiments show that DiFF-RF almost systematically outperforms the IF algorithm, but also challenges the one-class SVM baseline and a deep learning variational auto-encoder architecture. Furthermore, our experience shows that DiFF-RF can work well in the presence of small-scale learning data, which is conversely difficult for deep neural architectures. Finally, DiFF-RF is computationally efficient and can be easily parallelized on multi-core architectures.