论文标题
神经机器翻译而无需嵌入
Neural Machine Translation without Embeddings
论文作者
论文摘要
许多NLP模型都通过手工制作的令牌化规则和启发式子单词感应算法产生的子字代币序列运行。一个简单的通用替代方法是将每个计算机化的文本表示为通过UTF-8的字节序列,从而避免了对嵌入层的需求,因为代币类型(256)比尺寸少。令人惊讶的是,用每个字节的一hot表示无处不在的嵌入层不会损害性能。从英语到10种不同语言的字节到字节机的实验表明,在BLEU,竞争性角色级别甚至标准子字级模型方面都有一致的改进。一项更深入的研究表明,嵌入式模型与解码器输入掉落的组合相当于令牌辍学,这尤其是字节到字节模型。
Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.