Bootaug：通过混合实例过滤框架增强文本增强框架

论文标题

Bootaug：通过混合实例过滤框架增强文本增强框架

BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework

论文作者

Yang, Heng, Li, Ke

论文摘要

文本增强是一种有效的技术，可以解决自然语言处理中数据不足的问题。但是，现有的文本增强方法倾向于将重点放在少数场景上，并且通常在大型公共数据集上表现较差。我们的研究表明，现有的增强方法通常会生成具有变化特征空间的实例，从而导致增强数据的性能下降（例如，在基于方面的情感分类中EDA损失了$ \％$ \％$）。为了解决此问题，我们根据预先训练的语言模型提出了一个混合实例过滤框架（引导），该模型可以使用天然数据集维持类似的特征空间。 Bootaug可以转移到现有的文本增强方法（例如同义词替代和背部翻译）中，并在分类准确性上显着提高了增强性能的大约2-3 \％$。我们对三个分类任务和九个公共数据集的实验结果表明，BOITAUG解决了性能下降问题，并且优于最先进的文本增强方法。此外，我们发布代码，以帮助改善大型数据集上的现有增强方法。

Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses $\approx 2\%$ in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BootAug) based on pre-trained language models that can maintain a similar feature space with natural datasets. BootAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by $\approx 2-3\%$ in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BootAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题