用手动和自动标签的机器和深度学习方法，用于新闻分类的孟加拉语

论文标题

用手动和自动标签的机器和深度学习方法，用于新闻分类的孟加拉语

Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language

论文作者

Ahmad, Istiak, AlQurashi, Fahad, Mehmood, Rashid

论文摘要

由于应用程序分类，文本挖掘，情感分析，POS标记，命名实体识别，文本构成等应用程序，自然语言处理（NLP）的研究变得越来越重要。本文介绍了几种机器和深度学习方法，其中包含手动和自动标签，用于孟加拉语的新闻分类。我们实施了几种机器（ML）和深度学习（DL）算法。 ML算法是逻辑回归（LR），随机梯度下降（SGD），支持向量机（SVM），随机森林（RF）和K-Nearest邻居（KNN），与单词袋（bow），术语频率内文档频率（TF-IDF）和DOC2VEC启动模型一起使用。 DL算法是长期短期记忆（LSTM），双向LSTM（BILSTM），门控复发单元（GRU）和卷积神经网络（CNN），与Word2Vec，手套和FastText Word嵌入模型一起使用。我们使用潜在的Dirichlet分配（LDA）开发自动标记方法，并研究单标签和多标签文章分类方法的性能。为了调查性能，我们开发了从头开始的孟加拉语言中最大，最广泛的新闻分类数据集，其中包括1.8551亿个单词和1,257万个句子，八个不同类别中的664,880种新闻文章中包含句子，从2014年孟加拉国的六个流行在线新闻门户中进行了培训，该类别均在2014年孟加拉国六个流行的在线新闻门户中进行了培训。 91.83％的GRU和FastText对于手动标记的数据实现了最高的精度。对于自动标记案例，分别以57.72％和75％的速度为单标签和多标签数据达到最高精度。预计本文开发的方法将推进孟加拉和其他语言的研究。

Research in Natural Language Processing (NLP) has increasingly become important due to applications such as text classification, text mining, sentiment analysis, POS tagging, named entity recognition, textual entailment, and many others. This paper introduces several machine and deep learning methods with manual and automatic labelling for news classification in the Bangla language. We implemented several machine (ML) and deep learning (DL) algorithms. The ML algorithms are Logistic Regression (LR), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbour (KNN), used with Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Doc2Vec embedding models. The DL algorithms are Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Convolutional Neural Network (CNN), used with Word2vec, Glove, and FastText word embedding models. We develop automatic labelling methods using Latent Dirichlet Allocation (LDA) and investigate the performance of single-label and multi-label article classification methods. To investigate performance, we developed from scratch Potrika, the largest and the most extensive dataset for news classification in the Bangla language, comprising 185.51 million words and 12.57 million sentences contained in 664,880 news articles in eight distinct categories, curated from six popular online news portals in Bangladesh for the period 2014-2020. GRU and Fasttext with 91.83% achieve the highest accuracy for manually-labelled data. For the automatic labelling case, KNN and Doc2Vec at 57.72% and 75% achieve the highest accuracy for single-label and multi-label data, respectively. The methods developed in this paper are expected to advance research in Bangla and other languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题