论文标题
多尺度变压器语言模型
Multi-scale Transformer Language Models
论文作者
论文摘要
我们研究了多尺度的变压器语言模型,这些模型以多个尺度学习文本的表示,并提出了三种不同的体系结构,它们具有诱导性偏见来处理语言的层次结构。大规模语言建模基准测试基准的实验在经验上证明了有利的可能性与记忆足迹权衡,例如我们表明,与多伦多书库上的香草变压器相比,与少于一半的层数的香草变压器相比,可以使用30层训练30层的层次变体,其记忆足迹较小,并且更加困惑。我们在记忆足迹,计算时间和困惑方面分析了多个尺度上学习表示的优势,鉴于变形金刚的运行时间和记忆使用相对于序列长度的二次缩放,这特别吸引人。
We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments on large-scale language modeling benchmarks empirically demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show that it is possible to train a hierarchical variant with 30 layers that has 23% smaller memory footprint and better perplexity, compared to a vanilla transformer with less than half the number of layers, on the Toronto BookCorpus. We analyze the advantages of learned representations at multiple scales in terms of memory footprint, compute time, and perplexity, which are particularly appealing given the quadratic scaling of transformers' run time and memory usage with respect to sequence length.