多模式半监督学习框架，用于对话语音中的标点符号预测

论文标题

多模式半监督学习框架，用于对话语音中的标点符号预测

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

论文作者

Sunkara, Monica, Ronanki, Srikanth, Bekal, Dhanush, Bodapati, Sravan, Kirchhoff, Katrin

论文摘要

在这项工作中，我们通过从大量未标记的音频和文本数据中学习表征来探索一种多模式半监督的学习方法，以进行标点符号预测。语音处理中的常规方法通常使用强制对齐来编码每个框架的声学特征与单词级别特征，并对所得声和词汇表示的多模式融合。作为替代方案，我们探索了基于注意力的多模式融合，并将其性能与强迫对准融合进行比较。在Fisher语料库上进行的实验表明，我们所提出的方法在参考转录本和ASR输出上分别获得了基线BLSTM模型的绝对改善（F1分数）约为6-9％和〜3-4％。我们通过使用N-最佳列表执行数据增强来进一步提高模型的鲁棒性，从而提高了ASR输出的额外提高约2-6％。我们还通过对各种尺寸的语料库进行消融研究来证明半监督学习方法的有效性。当对1小时的语音和文本数据进行培训时，提出的模型比基线模型实现了约9-18％的绝对改进。

In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data. Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical representations. As an alternative, we explore attention based multimodal fusion and compare its performance with forced alignment based fusion. Experiments conducted on the Fisher corpus show that our proposed approach achieves ~6-9% and ~3-4% absolute improvement (F1 score) over the baseline BLSTM model on reference transcripts and ASR outputs respectively. We further improve the model robustness to ASR errors by performing data augmentation with N-best lists which achieves up to an additional ~2-6% improvement on ASR outputs. We also demonstrate the effectiveness of semi-supervised learning approach by performing ablation study on various sizes of the corpus. When trained on 1 hour of speech and text data, the proposed model achieved ~9-18% absolute improvement over baseline model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题