MSVD-Turkish：土耳其语中综合视觉和语言研究的综合多模式数据集

论文标题

MSVD-Turkish：土耳其语中综合视觉和语言研究的综合多模式数据集

MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish

论文作者

Citamak, Begum, Caglayan, Ozan, Kuyu, Menekse, Erdem, Erkut, Erdem, Aykut, Madhyastha, Pranava, Specia, Lucia

论文摘要

以自然语言的自动生成视频描述，也称为视频字幕，旨在理解视频的视觉内容，并产生一个自然语言句子，描绘了场景中的对象和动作。然而，这种具有挑战性的综合视力和语言问题主要针对英语。缺乏数据和其他语言的语言特性限制了这种语言现有方法的成功。在本文中，我们针对土耳其语，这是一种在形态上丰富且凝集的语言，与英语相比具有截然不同的特性。为此，我们通过仔细翻译MSVD（Microsoft Research Video Description corpus）数据集中的视频的英文描述，为该语言创建第一个大型视频字幕数据集。除了在土耳其语中启用视频字幕研究外，平行的英语描述还可以研究视频环境在（多模式）机器翻译中的作用。在我们的实验中，我们构建了用于视频字幕和多模式机器翻译的模型，并研究了不同单词分割方法和不同神经体系结构的效果，以更好地解决土耳其的特性。我们希望MSVD-Turkish数据集以及这项工作中报告的结果将导致更好的视频字幕和多模式的机器翻译模型，用于土耳其语和其他形态学丰富和凝集的语言。

Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other languages limit the success of existing approaches for such languages. In this paper we target Turkish, a morphologically rich and agglutinative language that has very different properties compared to English. To do so, we create the first large scale video captioning dataset for this language by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish. In addition to enabling research in video captioning in Turkish, the parallel English-Turkish descriptions also enables the study of the role of video context in (multimodal) machine translation. In our experiments, we build models for both video captioning and multimodal machine translation and investigate the effect of different word segmentation approaches and different neural architectures to better address the properties of Turkish. We hope that the MSVD-Turkish dataset and the results reported in this work will lead to better video captioning and multimodal machine translation models for Turkish and other morphology rich and agglutinative languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题