使用静态拆卸和机器学习的恶意软件分类

论文标题

使用静态拆卸和机器学习的恶意软件分类

Malware Classification Using Static Disassembly and Machine Learning

论文作者

Chen, Zhenshuo, Brophy, Eoin, Ward, Tomas

论文摘要

网络和系统安全现在非常关键问题。由于恶意软件的快速扩散，传统分析方法与巨大的样本相加困难。在本文中，我们提出了四个易于提取和小规模的功能，包括Windows PE部分的尺寸和权限，内容复杂性和导入库，以对恶意软件家族进行分类，并使用自动机器学习来搜索每个功能及其组合的最佳模型和超参数。与详细的与行为相关的功能（例如API序列）相比，建议的功能提供了有关恶意软件的宏观信息。该分析基于静态拆卸脚本和十六进制的机器代码。与动态行为分析不同，静态分析具有资源效率，并提供了完整的代码覆盖范围，但容易受到代码混淆和加密的影响。结果表明，在应用于静态分析时，在动态分析中运作良好的特征不一定有效。例如，API 4克仅达到57.96％的精度，并且涉及相对较高的尺寸特征集（5000个维度）。相比之下，新颖的提出的特征以及经典的机器学习算法（随机森林）的精度为99.40％，特征向量的尺寸要小得多（40个维度）。我们通过在IDA Pro中进行集成来证明这种方法的有效性，这也有助于收集新的培训样本和随后的模型再培训。

Network and system security are incredibly critical issues now. Due to the rapid proliferation of malware, traditional analysis methods struggle with enormous samples. In this paper, we propose four easy-to-extract and small-scale features, including sizes and permissions of Windows PE sections, content complexity, and import libraries, to classify malware families, and use automatic machine learning to search for the best model and hyper-parameters for each feature and their combinations. Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware. The analysis is based on static disassembly scripts and hexadecimal machine code. Unlike dynamic behavior analysis, static analysis is resource-efficient and offers complete code coverage, but is vulnerable to code obfuscation and encryption. The results demonstrate that features which work well in dynamic analysis are not necessarily effective when applied to static analysis. For instance, API 4-grams only achieve 57.96% accuracy and involve a relatively high dimensional feature set (5000 dimensions). In contrast, the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40% and the feature vector is of much smaller dimension (40 dimensions). We demonstrate the effectiveness of this approach through integration in IDA Pro, which also facilitates the collection of new training samples and subsequent model retraining.

下载PDF全文

下载文献需遵守相关版权规定

论文标题