论文标题
LCP-Dropout:基于压缩的多个子词分段,用于神经机器翻译
LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation
论文作者
论文摘要
在这项研究中,我们提出了一种基于数据压缩算法的子单词分割的简单有效的预处理方法。基于压缩的子词细分最近引起了对神经机器翻译中训练数据的预处理方法的大幅关注。其中,与传统方法相比,BPE/BPE-DropOut是最快,最有效的方法之一。但是,基于压缩的方法的缺点是,由于确定论,很难生成多个分割。为了克服这一难度,我们专注于一种称为本地一致解析(LCP)的概率字符串算法,该算法已应用于实现最佳压缩。通过LCP的概率机制,我们提出了用于改进BPE/BPE-Dropout的多种子单词分割的LCP抛弃,并表明它在从特别小的培训数据中学习时表现出色。
In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.