论文标题

使用多语言语言模型改善印尼文本分类

Improving Indonesian Text Classification Using Multilingual Language Model

论文作者

Putra, Ilham Firdausi, Purwarianti, Ayu

论文摘要

与英语相比,印尼文本分类任务的标记数据量很小。最近开发的多语言语言模型表明了其有效创建多语言表示的能力。本文研究了使用多语言语言模型组合英语和印尼数据对建立印尼文本分类(例如情感分析和仇恨言论)的影响。使用基于功能的方法,我们可以在各种数据尺寸和添加的英语数据上观察其性能。该实验表明,添加英语数据,尤其是如果印尼数据的数量很小,可以提高性能。使用微调方法,我们进一步展示了它在利用英语来构建印尼文本分类模型方面的有效性。

Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification (e.g., sentiment analysis and hate speech) using multilingual language models. Using the feature-based approach, we observe its performance on various data sizes and total added English data. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance. Using the fine-tuning approach, we further showed its effectiveness in utilizing the English language to build Indonesian text classification models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源