论文标题
Aradic:使用基于图像的字符嵌入和级别平衡损失的阿拉伯文档分类
AraDIC: Arabic Document Classification using Image-Based Character Embeddings and Class-Balanced Loss
论文作者
论文摘要
阿拉伯文本分类的古典和一些深度学习技术通常取决于复杂的形态分析,单词细分和手工制作的特征工程。可以使用字符级特征来消除这些。我们提出了一个新颖的端到端阿拉伯文档分类框架,基于阿拉伯文档的基于图像的分类器(ARADIC),灵感来自基于图像的字符嵌入的工作。 Aradic由基于图像的字符编码器和分类器组成。他们使用班级平衡损失以端到端的方式进行了培训,以解决长尾数据分发问题。为了评估Aradic的有效性,我们创建并发布了两个数据集,即阿拉伯Wikipedia标题(AWT)数据集和阿拉伯诗歌(ARAP)数据集。据我们所知,这是第一个基于图像的字符嵌入框架,解决了阿拉伯文本分类问题。我们还介绍了第一个基于深度学习的文本分类器,该文本分类器广泛评估了现代标准阿拉伯语,阿拉伯语和古典阿拉伯语。 Aradic分别显示了微观和宏F-SCORE的经典和深度学习基线的性能提高12.29%和23.05%。
Classical and some deep learning techniques for Arabic text classification often depend on complex morphological analysis, word segmentation, and hand-crafted feature engineering. These could be eliminated by using character-level features. We propose a novel end-to-end Arabic document classification framework, Arabic document image-based classifier (AraDIC), inspired by the work on image-based character embeddings. AraDIC consists of an image-based character encoder and a classifier. They are trained in an end-to-end fashion using the class balanced loss to deal with the long-tailed data distribution problem. To evaluate the effectiveness of AraDIC, we created and published two datasets, the Arabic Wikipedia title (AWT) dataset and the Arabic poetry (AraP) dataset. To the best of our knowledge, this is the first image-based character embedding framework addressing the problem of Arabic text classification. We also present the first deep learning-based text classifier widely evaluated on modern standard Arabic, colloquial Arabic and classical Arabic. AraDIC shows performance improvement over classical and deep learning baselines by 12.29% and 23.05% for the micro and macro F-score, respectively.