论文标题

您是否应该掩盖15%的蒙版语言建模?

Should You Mask 15% in Masked Language Modeling?

论文作者

Wettig, Alexander, Gao, Tianyu, Zhong, Zexuan, Chen, Danqi

论文摘要

掩盖语言模型(MLMS)常规掩盖了15%的令牌,因为人们认为更多的掩盖会留下不足的上下文来学习良好的表示;无论模型尺寸或掩盖策略如何,这种掩盖速率已被广泛使用。在这项工作中,我们重新审视了MLM预训练的重要选择。我们首先确定15%不是普遍的最佳选择,较大的模型应采用更高的掩蔽率。具体而言,我们发现胶水和小队上的BERT-LARGE尺寸模型的掩盖效果超过15%。有趣的是,极高的掩蔽率为80%仍然可以保持95%的微调性能,并且在语言探测中的大多数准确性,从而挑战了有关掩盖速率作用的传统智慧。然后,我们检查掩盖速率和掩盖策略之间的相互作用,发现均匀掩蔽需要与复杂的屏蔽策略(例如SPAN或PMI掩盖)相比,需要更高的掩盖速率。最后,我们认为提高掩蔽率具有两个不同的影响:这导致了更多的腐败,这使得预测任务变得更加困难。它还可以实现更多的预测,哪些受益优化。使用此框架,我们重新审视伯特的80-10-10腐败策略。总之,我们的结果有助于更好地理解MLM预训练。

Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models should adopt a higher masking rate. Specifically, we find that masking 40% outperforms 15% for BERT-large size models on GLUE and SQuAD. Interestingly, an extremely high masking rate of 80% can still preserve 95% fine-tuning performance and most of the accuracy in linguistic probing, challenging the conventional wisdom about the role of the masking rate. We then examine the interplay between masking rates and masking strategies and find that uniform masking requires a higher masking rate compared to sophisticated masking strategies such as span or PMI masking. Finally, we argue that increasing the masking rate has two distinct effects: it leads to more corruption, which makes the prediction task more difficult; it also enables more predictions, which benefits optimization. Using this framework, we revisit BERT's 80-10-10 corruption strategy. Together, our results contribute to a better understanding of MLM pre-training.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源