论文标题
MILMO:少数族裔多语言预训练的语言模型
MiLMo:Minority Multilingual Pre-trained Language Model
论文作者
论文摘要
预先训练的语言模型接受了大规模无监督数据的培训,它们只能在小规模的标记数据集上微调模型,并取得良好的结果。可以对多种语言进行多种语言预训练的语言模型,并且该模型可以同时了解多种语言。目前,对预训练模型的搜索主要集中在丰富的资源上,而对低资源语言(例如少数族裔语言)的研究相对较少,公共多语言的预训练的语言模型对于少数族裔语言无法正常工作。因此,本文构建了一种名为Milmo的多语言预培训模型,该模型在少数族裔语言任务上的表现更好,包括蒙古,藏族,Uyghur,哈萨克和韩国人。为了解决少数族裔语言上数据集的稀缺问题并验证MILMO模型的有效性,本文构建了名为MITC的少数族裔多语言文本分类数据集,并为每种语言训练Word2VEC模型。通过比较文本分类任务中的Word2VEC模型和预训练的模型,本文为少数族裔语言的下游任务研究提供了最佳方案。最终的实验结果表明,预训练的模型的性能优于Word2Vec模型的性能,并且在少数族裔多语言文本分类中获得了最佳结果。多语言预训练的模型MILMO,多语言Word2VEC模型和多语言文本分类数据集MITC在http://milmo.cmli-nlp.com/上发布。
Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/.