论文标题
VULBERTA:简化的源代码预先培训漏洞检测
VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection
论文作者
论文摘要
本文介绍了Vulberta,这是一种深入学习方法,可检测源代码中的安全漏洞。我们的方法在开源C/C/C ++项目中使用定制令牌的Roberta模型预先使用自定义令状管道。该模型了解了代码语法和语义的深刻知识表示,我们将利用这些语义来培训漏洞检测分类器。我们评估了几个数据集(vuldeepecker,draper,揭示和muvuldeepecker)和基准测试(codexglue和d2a)的二进制和多类漏洞检测任务的方法。评估结果表明,尽管具有概念上的简单性,但Vulberta实现了最新的性能,并且在不同数据集的现有方法上都胜过现有的方法,并且在培训数据的大小和模型参数数量方面成本有限。
This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.