论文标题
Bootaug:通过混合实例过滤框架增强文本增强框架
BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework
论文作者
论文摘要
文本增强是一种有效的技术,可以解决自然语言处理中数据不足的问题。但是,现有的文本增强方法倾向于将重点放在少数场景上,并且通常在大型公共数据集上表现较差。我们的研究表明,现有的增强方法通常会生成具有变化特征空间的实例,从而导致增强数据的性能下降(例如,在基于方面的情感分类中EDA损失了$ \%$ \%$)。为了解决此问题,我们根据预先训练的语言模型提出了一个混合实例过滤框架(引导),该模型可以使用天然数据集维持类似的特征空间。 Bootaug可以转移到现有的文本增强方法(例如同义词替代和背部翻译)中,并在分类准确性上显着提高了增强性能的大约2-3 \%$。我们对三个分类任务和九个公共数据集的实验结果表明,BOITAUG解决了性能下降问题,并且优于最先进的文本增强方法。此外,我们发布代码,以帮助改善大型数据集上的现有增强方法。
Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses $\approx 2\%$ in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BootAug) based on pre-trained language models that can maintain a similar feature space with natural datasets. BootAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by $\approx 2-3\%$ in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BootAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.