我看到您听到的：一种以视觉为灵感的方法来定位单词

论文标题

我看到您听到的：一种以视觉为灵感的方法来定位单词

I see what you hear: a vision-inspired method to localize words

论文作者

Samragh, Mohammad, Kundu, Arnav, Hu, Ting-Yao, Cho, Minsik, Chadha, Aman, Shrivastava, Ashish, Tuzel, Oncel, Naik, Devang

论文摘要

本文探讨了使用视觉对象检测技术在语音数据中进行单词定位的可能性。对象检测在当代文献中已彻底研究了视觉数据。指出音频可以解释为一维图像，对象定位技术对于单词定位根本上是有用的。在这个想法的基础上，我们提出了一种轻巧的解决方案，用于单词检测和本地化。我们将边界框回归用于单词本地化，这使我们的模型能够检测给定音频流中关键字的发生，偏移和持续时间。我们尝试LibrisPeech并训练模型以本地化1000个字。与现有工作相比，我们的方法将模型大小降低了94％，并将F1分数提高了6.5 \％。

This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94%, and improves the F1 score by 6.5\%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题