论文标题
视觉扎根语言理解的文本监督
Textual Supervision for Visually Grounded Spoken Language Understanding
论文作者
论文摘要
直接从语音中理解语言的视觉界面模型,直接从语音中提取语义信息,而无需依赖转录。这对于低资源语言很有用,其中转录可能是昂贵或无法获得的。最近的工作表明,如果在训练时可用转录,这些模型可以改善。但是,目前尚不清楚端到端方法如何与传统的基于管道的方法进行访问时如何进行抄写。比较不同的策略,我们发现,当有足够的文本可用时,管道方法的效果更好。考虑到低资源语言,我们还表明可以有效地使用翻译来代替转录,但需要更多的数据来获得类似的结果。
Visually-grounded models of spoken language understanding extract semantic information directly from speech, without relying on transcriptions. This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain. Recent work showed that these models can be improved if transcriptions are available at training time. However, it is not clear how an end-to-end approach compares to a traditional pipeline-based approach when one has access to transcriptions. Comparing different strategies, we find that the pipeline approach works better when enough text is available. With low-resource languages in mind, we also show that translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.