论文标题
锤子OCR:局部敏感的散列神经网络,用于场景文本识别
Hamming OCR: A Locality Sensitive Hashing Neural Network for Scene Text Recognition
论文作者
论文摘要
最近,受到变压器的启发,基于自我注意力的场景文本识别方法取得了出色的表现。但是,我们发现,随着词典的增加,模型的大小迅速扩大。具体而言,SoftMax分类层和输出嵌入层的参数数与词汇大小成正比。它阻碍了轻巧的文本识别模型的开发,特别是用于中文和多种语言。因此,我们提出了一个名为Hamming OCR的轻巧场景文本识别模型。在此模型中,提出了一种新型的Hamming分类器,该分类器采用局部性敏感的哈希(LSH)算法来编码每个字符,以替换SoftMax回归,并直接使用生成的LSH代码来替换输出嵌入。我们还提出了一个简化的变压器解码器,以减少参数数量,通过删除前馈网络并使用跨层参数共享技术。与传统方法相比,分类和嵌入层中参数的数量独立于词汇的大小,这大大减少了存储需求而不会损失准确性。在几个数据集上进行的实验结果,包括四个公共台面和由具有超过20,000个字符的Synthtext合成的中国文本数据集,表明Hamming OCR可以实现竞争成果。
Recently, inspired by Transformer, self-attention-based scene text recognition approaches have achieved outstanding performance. However, we find that the size of model expands rapidly with the lexicon increasing. Specifically, the number of parameters for softmax classification layer and output embedding layer are proportional to the vocabulary size. It hinders the development of a lightweight text recognition model especially applied for Chinese and multiple languages. Thus, we propose a lightweight scene text recognition model named Hamming OCR. In this model, a novel Hamming classifier, which adopts locality sensitive hashing (LSH) algorithm to encode each character, is proposed to replace the softmax regression and the generated LSH code is directly employed to replace the output embedding. We also present a simplified transformer decoder to reduce the number of parameters by removing the feed-forward network and using cross-layer parameter sharing technique. Compared with traditional methods, the number of parameters in both classification and embedding layers is independent on the size of vocabulary, which significantly reduces the storage requirement without loss of accuracy. Experimental results on several datasets, including four public benchmaks and a Chinese text dataset synthesized by SynthText with more than 20,000 characters, shows that Hamming OCR achieves competitive results.