集群形式：基于聚类的稀疏变压器，用于远程依赖性编码

论文标题

集群形式：基于聚类的稀疏变压器，用于远程依赖性编码

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

论文作者

Wang, Shuohang, Zhou, Luowei, Gan, Zhe, Chen, Yen-Chun, Fang, Yuwei, Sun, Siqi, Cheng, Yu, Liu, Jingjing

论文摘要

变压器在深度学习领域已变得无处不在。注定其成功的关键要素之一是自我发挥的机制，它允许完全连接的上下文编码输入令牌。然而，尽管它在对短序列进行建模方面有效，但在处理具有极端长期依赖性的输入时，自我发作会受到影响，因为它的复杂性相对于序列长度倍增。因此，使用滑动窗口通常在块中通过变压器编码长序列。在本文中，我们提出了集群形式，这是一种新型基于聚类的稀疏变压器，以跨块序列进行注意。提出的框架在两种独特的变压器层上枢纽：滑动窗口层和集群形式层，它们共同和迭代地编码本地序列信息以及全局上下文。这种新的设计允许在本地窗口以外的信息集成，这对于依赖远程依赖性的问答（QA）任务特别有益。实验表明，集群形式在几个主要的质量检查基准上实现了最先进的性能。

Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its effectiveness in modeling short sequences, self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically with respect to the sequence length. Therefore, long sequences are often encoded by Transformer in chunks using a sliding window. In this paper, we propose Cluster-Former, a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer, which encode local sequence information and global context jointly and iteratively. This new design allows information integration beyond local windows, which is especially beneficial for question answering (QA) tasks that rely on long-range dependencies. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题