论文标题
天才:基于素描的语言模型通过极端和选择性掩盖文本生成和增强的培训预训练
GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation
论文作者
论文摘要
我们介绍了Genius:使用草图作为输入的有条件的文本生成模型,它可以填充给定草图的缺失上下文(由蒙版令牌串联的文本跨度,短语或单词组成的关键信息)。 Genius在大规模的文本语料库中进行了预训练,并使用极端和选择性的掩盖策略从Sketch Objective进行了新的重建,从而使其能够生成不同质量和高质量的文本给定的草图。与其他竞争性有条件语言模型(CLM)的比较揭示了Genius文本生成质量的优势。我们进一步表明,天才可以用作各种自然语言处理(NLP)任务的强大而可以使用的数据增强工具。大多数现有的文本数据增强方法要么过于保守,要么通过创建全新的样本来对原始文本进行少量更改,要么是过于侵略性的。对于天才,我们提出了Geniusaug,该Geniusaug首先从原始训练集中提取目标感知的草图,然后根据草图生成新样本。 6个文本分类数据集的经验实验表明,GeniusAug显着改善了模型在分布(ID)和分布(OOD)设置中的性能。我们还证明了Geniusaug对指定实体识别(NER)和机器阅读理解(MRC)任务的有效性。 (代码和模型可在https://github.com/microsoft/scglab和https://github.com/beyondguo/genius上公开获得)
We introduce GENIUS: a conditional text generation model using sketches as input, which can fill in the missing contexts for a given sketch (key information consisting of textual spans, phrases, or words, concatenated by mask tokens). GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective using an extreme and selective masking strategy, enabling it to generate diverse and high-quality texts given sketches. Comparison with other competitive conditional language models (CLMs) reveals the superiority of GENIUS's text generation quality. We further show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks. Most existing textual data augmentation methods are either too conservative, by making small changes to the original text, or too aggressive, by creating entirely new samples. With GENIUS, we propose GeniusAug, which first extracts the target-aware sketches from the original training set and then generates new samples based on the sketches. Empirical experiments on 6 text classification datasets show that GeniusAug significantly improves the models' performance in both in-distribution (ID) and out-of-distribution (OOD) settings. We also demonstrate the effectiveness of GeniusAug on named entity recognition (NER) and machine reading comprehension (MRC) tasks. (Code and models are publicly available at https://github.com/microsoft/SCGLab and https://github.com/beyondguo/genius)