论文标题
图像字幕通过图像变压器
Image Captioning through Image Transformer
论文作者
论文摘要
图像的自动字幕是一项结合图像分析和文本生成的挑战的任务。字幕的一个重要方面是关注概念:如何决定要描述什么以及按照顺序。受文本分析和翻译成功的启发,以前的工作提出了用于图像字幕的\ textit {变形金刚}架构。但是,图像中\ textIt {语义单位}之间的结构(通常是对象检测模型的检测区域)和句子(每个单词)是不同的。已经完成了有限的工作,以使变压器的内部体系结构适应图像。在这项工作中,我们介绍了\ textbf {\ textIt {image transformer}},该}由修改的编码变压器和隐式解码变压器组成,这是由于图像区域之间的相对空间关系而动机。我们的设计扩大了原始变压器层的内部体系结构,以适应图像的结构。只有区域功能作为输入,我们的模型就可以在MSCOCO离线和在线测试基准上实现新的最新性能。
Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.