DNS错别字方程域检测：基于数据分析和机器学习的方法

论文标题

DNS错别字方程域检测：基于数据分析和机器学习的方法

DNS Typo-squatting Domain Detection: A Data Analytics & Machine Learning Based Approach

论文作者

Moubayed, Abdallah, Injadat, MohammadNoor, Shami, Abdallah, Lutfiyya, Hanan

论文摘要

域名系统（DNS）是当前基于IP网络的关键组成部分，因为它是IP分辨率名称的标准机制。但是，由于缺乏数据完整性和来源身份验证过程，因此容易受到各种攻击的影响。一种这样的攻击是打字。检测此攻击尤其重要，因为它可能对公司秘密构成威胁，可以用来窃取信息或实施欺诈。在本文中，提出了一种基于机器学习的方法来应对打字问题。为此，首先使用探索性数据分析来更好地了解在八个基于域名的提取功能中观察到的趋势。此外，提出了使用五种分类算法构建的大多数基于投票的集合学习分类器，该分类器提出了可以很高准确地检测可疑域。此外，通过使用K-Means聚类算法在未标记的数据集中研究相同的功能并通过应用开发的集合学习分类器来验证观察到的趋势。结果表明，合法域的名称长度较小，独特字符较少。此外，开发的合奏学习分类器在准确性，精度和F得分方面表现更好。此外，显示使用聚类时观察到类似的趋势。但是，被确定为可疑的域的数量很高。因此，将集合学习分类器应用于结果，结果表明，被识别为可疑的域的数量降低了几乎五倍，同时仍然在特征的统计数据方面保持了相同的趋势。

Domain Name System (DNS) is a crucial component of current IP-based networks as it is the standard mechanism for name to IP resolution. However, due to its lack of data integrity and origin authentication processes, it is vulnerable to a variety of attacks. One such attack is Typosquatting. Detecting this attack is particularly important as it can be a threat to corporate secrets and can be used to steal information or commit fraud. In this paper, a machine learning-based approach is proposed to tackle the typosquatting vulnerability. To that end, exploratory data analytics is first used to better understand the trends observed in eight domain name-based extracted features. Furthermore, a majority voting-based ensemble learning classifier built using five classification algorithms is proposed that can detect suspicious domains with high accuracy. Moreover, the observed trends are validated by studying the same features in an unlabeled dataset using K-means clustering algorithm and through applying the developed ensemble learning classifier. Results show that legitimate domains have a smaller domain name length and fewer unique characters. Moreover, the developed ensemble learning classifier performs better in terms of accuracy, precision, and F-score. Furthermore, it is shown that similar trends are observed when clustering is used. However, the number of domains identified as potentially suspicious is high. Hence, the ensemble learning classifier is applied with results showing that the number of domains identified as potentially suspicious is reduced by almost a factor of five while still maintaining the same trends in terms of features' statistics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题