论文标题
孟加拉语的仇恨言论检测:数据集及其基线评估
Hate Speech detection in the Bengali language: A dataset and its baseline evaluation
论文作者
论文摘要
YouTube和Facebook等社交媒体网站已成为每个人生活中不可或缺的一部分,在过去的几年中,社交媒体评论部分中的仇恨言论迅速增加。社交媒体网站上仇恨言论的检测面临着各种挑战,包括小型数据集,适当模型的发现以及功能分析方法的选择。此外,由于缺乏标记的黄金标记数据集,孟加拉语社区对于孟加拉语社区来说更为严重。本文介绍了一个新的数据集,其中包含30,000条用户评论,这些评论由人群采购和专家变化。所有评论均从YouTube和Facebook评论部分收集,并分为七个类别:体育,娱乐,宗教,政治,犯罪,名人和Tiktok&Meme。总共有50个注释者注释每个评论三次,多数票被视为最终注释。尽管如此,我们已经进行了基本实验和几种深度学习模型,并在此数据集中进行了广泛的孟加拉语嵌入,例如Word2Vec,FastText和BengfastText,以促进未来的研究机会。该实验表明,尽管所有深度学习模型都表现良好,但SVM的精度为87.5%。我们的核心贡献是使此基准数据集可用并可以访问,以促进孟加拉仇恨言论检测领域的进一步研究。
Social media sites such as YouTube and Facebook have become an integral part of everyone's life and in the last few years, hate speech in the social media comment section has increased rapidly. Detection of hate speech on social media websites faces a variety of challenges including small imbalanced data sets, the findings of an appropriate model and also the choice of feature analysis method. further more, this problem is more severe for the Bengali speaking community due to the lack of gold standard labelled datasets. This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts. All the comments are collected from YouTube and Facebook comment section and classified into seven categories: sports, entertainment, religion, politics, crime, celebrity and TikTok & meme. A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation. Nevertheless, we have conducted base line experiments and several deep learning models along with extensive pre-trained Bengali word embedding such as Word2Vec, FastText and BengFastText on this dataset to facilitate future research opportunities. The experiment illustrated that although all deep learning models performed well, SVM achieved the best result with 87.5% accuracy. Our core contribution is to make this benchmark dataset available and accessible to facilitate further research in the field of in the field of Bengali hate speech detection.