Naijasenti：尼日利亚的Twitter情感语料库，用于多语言分析

论文标题

Naijasenti：尼日利亚的Twitter情感语料库，用于多语言分析

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

论文作者

Muhammad, Shamsuddeen Hassan, Adelani, David Ifeoluwa, Ruder, Sebastian, Ahmad, Ibrahim Said, Abdulmumin, Idris, Bello, Bello Shehu, Choudhury, Monojit, Emezue, Chris Chinenye, Abdullahi, Saheed Salahudeen, Aremu, Anuoluwapo, Jeorge, Alipio, Brazdil, Pavel

论文摘要

情感分析是NLP中研究最广泛的应用程序之一，但大多数工作都集中在具有大量数据的语言上。我们介绍了尼日利亚的四种口语最广泛的语言（Hausa，Igbo，Nigerian-Pidgin和Yorùbá）的第一个大规模的人类通知的Twitter情感数据集，其中包括大约30,000个注释的推文（尼日利亚人和14,000个尼日利亚语），其中包括大约14,000个尼日利亚语），包括大量的代码tweet theet theet theets Tweets。我们提出了文本收集，过滤，处理和标记方法，使我们能够为这些低资源语言创建数据集。我们评估了预先训练的模型和数据集上的转移策略。我们发现，特定于语言的模型和语言适应性fining通常表现最好。我们将数据集，训练有素的模型，情感词典和代码发布到激励措施搜索以代表性不足的语言中的情感分析。

Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yorùbá ) consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a rangeof pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptivefine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivizeresearch on sentiment analysis in under-represented languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题