论文标题
提高非拉丁蛋白场景文本识别的准确性
Towards Boosting the Accuracy of Non-Latin Scene Text Recognition
论文作者
论文摘要
由于多个因素,简单的词汇统计信息,更新的数据生成工具和写作系统,因此在拉丁语中,场景文本识别的识别要比非拉丁语语言要好得多。本文通过将英语数据集与非拉丁语语言进行比较,研究了较低精度的可能原因。我们比较单词图像和单词长度统计的各种功能,例如大小(宽度和高度)。在过去的十年中,使用强大的深度学习技术生成合成数据集已极大地改善了场景文本识别。通过更改(i)字体来创建合成数据和(ii)创建的单词映像的数量,对英语进行了几个受控实验。我们发现这些因素对于场景文本识别系统至关重要。英文合成数据集使用1400多个字体,而阿拉伯语和其他非拉丁蛋白数据集则利用少于100个字体来生成数据。由于这些语言中的某些语言是不同区域的一部分,因此我们通过基于区域的搜索获得其他字体,以改善阿拉伯语和Devanagari的场景文本识别模型。与以前的作品或基线相比,我们将阿拉伯语MLT-17和MLT-19数据集的单词识别率(WRR)提高了24.54%和2.32%。对于IIIT-ILST和MLT-19 Devanagari数据集,我们实现了7.88%和3.72%的WRR收益。
Scene-text recognition is remarkably better in Latin languages than the non-Latin languages due to several factors like multiple fonts, simplistic vocabulary statistics, updated data generation tools, and writing systems. This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages. We compare various features like the size (width and height) of the word images and word length statistics. Over the last decade, generating synthetic datasets with powerful deep learning techniques has tremendously improved scene-text recognition. Several controlled experiments are performed on English, by varying the number of (i) fonts to create the synthetic data and (ii) created word images. We discover that these factors are critical for the scene-text recognition systems. The English synthetic datasets utilize over 1400 fonts while Arabic and other non-Latin datasets utilize less than 100 fonts for data generation. Since some of these languages are a part of different regions, we garner additional fonts through a region-based search to improve the scene-text recognition models in Arabic and Devanagari. We improve the Word Recognition Rates (WRRs) on Arabic MLT-17 and MLT-19 datasets by 24.54% and 2.32% compared to previous works or baselines. We achieve WRR gains of 7.88% and 3.72% for IIIT-ILST and MLT-19 Devanagari datasets.