论文标题
集合的文本变压器用于场景文本检测
Aggregated Text Transformer for Scene Text Detection
论文作者
论文摘要
本文探讨了自然图像中场景文本检测的多尺度聚合策略。我们介绍了汇总的文本变压器(ATTR),该文本变压器旨在用多尺度的自发机制代表场景图像中的文本。从具有多种分辨率的图像金字塔开始,首先以不同的尺度提取这些特征,然后将其馈入变形金刚的编码器编码器体系结构。多尺度图像表示功能稳健,并包含有关各种尺寸文本内容的丰富信息。文本变压器汇总了这些功能,以学习跨不同尺度的交互并改善文本表示。所提出的方法通过将每个文本实例表示为单个二进制掩码来检测场景文本,该掩码宽容曲线文本和具有密集实例的区域。公共场景文本检测数据集的广泛实验证明了所提出的框架的有效性。
This paper explores the multi-scale aggregation strategy for scene text detection in natural images. We present the Aggregated Text TRansformer(ATTR), which is designed to represent texts in scene images with a multi-scale self-attention mechanism. Starting from the image pyramid with multiple resolutions, the features are first extracted at different scales with shared weight and then fed into an encoder-decoder architecture of Transformer. The multi-scale image representations are robust and contain rich information on text contents of various sizes. The text Transformer aggregates these features to learn the interaction across different scales and improve text representation. The proposed method detects scene texts by representing each text instance as an individual binary mask, which is tolerant of curve texts and regions with dense instances. Extensive experiments on public scene text detection datasets demonstrate the effectiveness of the proposed framework.