深度关注（DWATT）：用于数据有效分类的层融合方法

论文标题

深度关注（DWATT）：用于数据有效分类的层融合方法

Depth-Wise Attention (DWAtt): A Layer Fusion Method for Data-Efficient Classification

论文作者

ElNokrashy, Muhammad, AlKhamissi, Badr, Diab, Mona

论文摘要

在大型文本数据上预估计的语言模型已显示出同时编码不同类型的知识的语言模型。传统上，仅在适应新任务或数据时才使用上一层的功能。我们提出，在使用或填充深度预验证的模型时，可能与下游任务相关的中间层特征被埋葬得太深，无法在所需的样品或步骤中有效地使用。为了测试这一点，我们提出了一种新的层融合方法：深度关注（DWATT），以帮助来自非最佳层的重新表面信号。我们将DWATT与基本的基于串联的层融合方法（CONCAT）进行比较，并将两者与更深的模型基线进行比较 - 所有这些基线都保持在相似的参数预算之内。我们的发现表明，DWATT和CONCAT比基线更为逐步和样本，尤其是在几个射击设置中。 DWATT优于较大的数据大小的表现。在CONLL-03 NER上，层融合在不同的几个尺寸时显示3.68--9.73％的F1增益。在各种培训方案中，层融合模型的表现明显优于基线，这些培训方案具有不同的数据大小，架构和培训约束。

Language Models pretrained on large textual data have been shown to encode different types of knowledge simultaneously. Traditionally, only the features from the last layer are used when adapting to new tasks or data. We put forward that, when using or finetuning deep pretrained models, intermediate layer features that may be relevant to the downstream task are buried too deep to be used efficiently in terms of needed samples or steps. To test this, we propose a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface signals from non-final layers. We compare DWAtt to a basic concatenation-based layer fusion method (Concat), and compare both to a deeper model baseline -- all kept within a similar parameter budget. Our findings show that DWAtt and Concat are more step- and sample-efficient than the baseline, especially in the few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03 NER, layer fusion shows 3.68--9.73% F1 gain at different few-shot sizes. The layer fusion models presented significantly outperform the baseline in various training scenarios with different data sizes, architectures, and training constraints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题