使用机器学习技术检测恶意网站

论文标题

使用机器学习技术检测恶意网站

Detection of Malicious Websites Using Machine Learning Techniques

论文作者

Oshingbesan, Adebayo, Ekoh, Courage, Okobi, Chukwuemeka, Munezero, Aime, Richard, Kagame

论文摘要

在检测恶意网站时，一种常见的方法是使用黑名单，这些黑名单本身并不详尽并且无法推广到新的恶意网站。自动检测新遇到的恶意网站将有助于减少这种攻击形式的脆弱性。在这项研究中，我们探讨了使用十种机器学习模型根据词汇特征对恶意网站进行分类的使用，并了解它们如何在数据集中概括。具体而言，我们在不同的数据集上训练，验证和测试了这些模型，然后进行了交叉数据集分析。从我们的分析中，我们发现k-nearest邻居是唯一在数据集中持续高高的模型。其他模型，例如随机森林，决策树，逻辑回归和支持向量机器，也始终超过一个基准模型，即在所有指标和数据集中将每个链接视为恶意。另外，我们没有发现任何证据表明词汇特征的任何子集跨模型或数据集都概括了。这项研究应与网络安全专业人员和学术研究人员有关，因为它可以构成现实生活中检测系统或进一步研究工作的基础。

In detecting malicious websites, a common approach is the use of blacklists which are not exhaustive in themselves and are unable to generalize to new malicious sites. Detecting newly encountered malicious websites automatically will help reduce the vulnerability to this form of attack. In this study, we explored the use of ten machine learning models to classify malicious websites based on lexical features and understand how they generalize across datasets. Specifically, we trained, validated, and tested these models on different sets of datasets and then carried out a cross-datasets analysis. From our analysis, we found that K-Nearest Neighbor is the only model that performs consistently high across datasets. Other models such as Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines also consistently outperform a baseline model of predicting every link as malicious across all metrics and datasets. Also, we found no evidence that any subset of lexical features generalizes across models or datasets. This research should be relevant to cybersecurity professionals and academic researchers as it could form the basis for real-life detection systems or further research work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题