论文标题

打破伯特:评估和优化稀疏注意力

Breaking BERT: Evaluating and Optimizing Sparsified Attention

论文作者

Brahma, Siddhartha, Zablotskaia, Polina, Mimno, David

论文摘要

变压器允许在所有对代币之间进行关注,但是有理由相信这些连接中的大多数以及它们的二次时间和记忆可能是不需要的。但是哪个呢?我们通过一系列消融实验评估了稀疏模式的影响。首先,我们比较基于语法,词汇相似性和代币位置的口罩与随机连接,并测量最少的模式降低性能。我们发现,在三个常见的填充任务上,即使使用注意力至少78%的注意力,如果在以后的变压器层上应用,对性能几乎没有影响,但是在整个网络中应用稀疏性会大大降低性能。其次,我们改变了先前工作支持的三种模式的稀疏度,并发现与邻近令牌的连接最为重要。最后,我们将稀疏性视为可优化的参数,并提出了一种算法,以了解相邻连接的程度,从而在接近现有方法的性能的同时,可以很好地控制精度 - 差异。

Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for three patterns supported by previous work, and find that connections to neighbouring tokens are the most significant. Finally, we treat sparsity as an optimizable parameter, and present an algorithm to learn degrees of neighboring connections that gives a fine-grained control over the accuracy-sparsity trade-off while approaching the performance of existing methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源