论文标题
凯尔特文本的自动语言标识
Automatic Language Identification for Celtic Texts
论文作者
论文摘要
语言标识是重要的自然语言处理任务。它在文献中得到了彻底的研究。但是,某些问题仍然是开放的。这项工作介绍了凯尔特语家族的示例中相关的低资源语言的识别。 这项工作的主要目标是:(1)收集三种凯尔特语言的数据集; (2)准备一种方法来识别凯尔特人家族的语言,即培训成功的分类模型; (3)评估不同特征提取方法的影响,并探索无监督模型作为特征提取技术的适用性; (4)在减少的注释集中尝试无监督的特征提取。 我们收集了一个新的数据集,包括爱尔兰,苏格兰,威尔士和英语记录。我们测试了具有传统统计特征的SVM和神经网络等监督模型,以及聚类,自动编码器和主题建模方法的输出。分析表明,无监督的特征可以作为N-Gram特征向量的有价值的扩展。这导致了更多纠缠的课程的性能改善。最佳模型获得了98 \%F1分数和97 \%MCC。密集的神经网络始终优于SVM模型。 由于可用的带注释的培训数据缺乏,低资源语言也很具有挑战性。这项工作使用缩小的标记数据集上的无监督功能提取来评估分类器的性能以处理此问题。结果发现,无监督的特征向量对标记的设置还原更为强大。因此,他们证明可以通过标记较少的数据来帮助实现可比的分类性能。
Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family. This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to identify the languages from the Celtic family, i.e. to train a successful classification model; (3) to evaluate the influence of different feature extraction methods, and explore the applicability of the unsupervised models as a feature extraction technique; (4) to experiment with the unsupervised feature extraction on a reduced annotated set. We collected a new dataset including Irish, Scottish, Welsh and English records. We tested supervised models such as SVM and neural networks with traditional statistical features alongside the output of clustering, autoencoder, and topic modelling methods. The analysis showed that the unsupervised features could serve as a valuable extension to the n-gram feature vectors. It led to an improvement in performance for more entangled classes. The best model achieved a 98\% F1 score and 97\% MCC. The dense neural network consistently outperformed the SVM model. The low-resource languages are also challenging due to the scarcity of available annotated training data. This work evaluated the performance of the classifiers using the unsupervised feature extraction on the reduced labelled dataset to handle this issue. The results uncovered that the unsupervised feature vectors are more robust to the labelled set reduction. Therefore, they proved to help achieve comparable classification performance with much less labelled data.