引入Amharic的各种语义模型：具有多个任务和数据集的实验和评估

论文标题

引入Amharic的各种语义模型：具有多个任务和数据集的实验和评估

Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets

论文作者

Yimam, Seid Muhie, Ayele, Abinew Ali, Venkatesh, Gopalakrishnan, Gashaw, Ibrahim, Biemann, Chris

论文摘要

不同预训练的语义模型的可用性使下游应用程序的机器学习组件的快速开发。尽管低资源语言有丰富的文本数据可用，但仅公开使用了一些语义模型。公共可用的预训练模型通常是作为语义模型的多语言版本构建，由于上下文变化，它们不能很好地适合每种语言。在这项工作中，我们为Amharic介绍了不同的语义模型。在尝试现有的预训练的语义模型之后，我们使用单语文本语料库训练了九个新的不同模型。使用Word2Vec嵌入，分布词库（DT），上下文嵌入和通过网络嵌入算法获得的DT嵌入构建模型。此外，我们将这些模型用于不同的NLP任务并研究其影响。我们发现，新训练的模型的性能要比预先训练的多语言模型更好。此外，基于罗伯塔（Roberta）上下文嵌入的模型比Word2Vec模型表现更好。

The availability of different pre-trained semantic models enabled the quick development of machine learning components for downstream applications. Despite the availability of abundant text data for low resource languages, only a few semantic models are publicly available. Publicly available pre-trained models are usually built as a multilingual version of semantic models that can not fit well for each language due to context variations. In this work, we introduce different semantic models for Amharic. After we experiment with the existing pre-trained semantic models, we trained and fine-tuned nine new different models using a monolingual text corpus. The models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings obtained via network embedding algorithms. Moreover, we employ these models for different NLP tasks and investigate their impact. We find that newly trained models perform better than pre-trained multilingual models. Furthermore, models based on contextual embeddings from RoBERTA perform better than the word2Vec models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题