在自动连续提示语音识别中使用手动融合的手部模型重新同步进行多模式融合

论文标题

在自动连续提示语音识别中使用手动融合的手部模型重新同步进行多模式融合

Re-synchronization using the Hand Preceding Model for Multi-modal Fusion in Automatic Continuous Cued Speech Recognition

论文作者

Liu, Li, Feng, Gang, Beautemps, Denis, Zhang, Xiao-Ping

论文摘要

提示的演讲（CS）是通过手工编码补充的增强嘴唇阅读，对聋人非常有帮助。自动CS识别可以帮助聋人与其他人之间的沟通。由于嘴唇和手动运动的异步性质，它们在自动CS识别中的融合是一个具有挑战性的问题。在这项工作中，我们为多模式融合提出了一种新颖的重新同步程序，该程序将手特征与嘴唇特征保持一致。它是通过延迟手部位和手部形状的最佳手动之前的时间来实现的，该时间是通过调查CS中的手部位和手形运动的临时组织来得出的。这种重新同步过程将其纳入了实用的连续CS识别系统中，该系统将卷积神经网络（CNN）与多流式隐藏的马尔可夫模型（MSHMM）结合在一起。与最先进的体系结构（72.04 \％）相比，保留76.6 \％CS音素识别正确性的显着改善已实现了76.6 \％CS音素识别的正确性（72.04 \％），这并未考虑到CS中多模式融合的异步。据我们所知，这是解决自动连续CS识别中异步多模式融合的第一项工作。

Cued Speech (CS) is an augmented lip reading complemented by hand coding, and it is very helpful to the deaf people. Automatic CS recognition can help communications between the deaf people and others. Due to the asynchronous nature of lips and hand movements, fusion of them in automatic CS recognition is a challenging problem. In this work, we propose a novel re-synchronization procedure for multi-modal fusion, which aligns the hand features with lips feature. It is realized by delaying hand position and hand shape with their optimal hand preceding time which is derived by investigating the temporal organizations of hand position and hand shape movements in CS. This re-synchronization procedure is incorporated into a practical continuous CS recognition system that combines convolutional neural network (CNN) with multi-stream hidden markov model (MSHMM). A significant improvement of about 4.6\% has been achieved retaining 76.6\% CS phoneme recognition correctness compared with the state-of-the-art architecture (72.04\%), which did not take into account the asynchrony of multi-modal fusion in CS. To our knowledge, this is the first work to tackle the asynchronous multi-modal fusion in the automatic continuous CS recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题