论文标题

越南社交媒体文本的文本增强的经验研究

Empirical Study of Text Augmentation on Social Media Text in Vietnamese

论文作者

Luu, Son T., Van Nguyen, Kiet, Nguyen, Ngan Luu-Thuy

论文摘要

在文本分类问题中,数据集中标签的不平衡会影响文本分类模型的性能。实际上,有关用户评论的社交网站评论的数据未完全出现 - 管理员通常只允许积极评论并隐藏负面评论。因此,当收集有关用户评论的数据时,数据通常会偏向一个标签,这会导致数据集变得不平衡并使模型的能力恶化。应用数据增强技术用于解决数据集类别之间的不平衡问题,从而提高了预测模型的准确性。在本文中,我们对VLSP2019的仇恨言论进行了增强技术,对越南社交文本和UIT -VSFC:越南学生的反馈语料库进行情感分析。在这两个语料库的F1-MaCro得分中,增强的结果增加了约1.5%。

In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually skewed about one label, which leads the dataset to become imbalanced and deteriorate the model's ability. The data augmentation techniques are applied to solve the imbalance problem between classes of the dataset, increasing the prediction model's accuracy. In this paper, we performed augmentation techniques on the VLSP2019 Hate Speech Detection on Vietnamese social texts and the UIT - VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis. The result of augmentation increases by about 1.5% in the F1-macro score on both corpora.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源