论文标题
使用人声的实时MRI
Speaker dependent acoustic-to-articulatory inversion using real-time MRI of the vocal tract
论文作者
论文摘要
声学反转(AAI)方法估计声学信号的关节运动,这在多个任务中很有用,例如语音识别,综合,说话的头和语言辅导。大多数较早的反转研究是基于点跟踪的关节技术(例如EMA或XRMB)。 RTMRI的优点在于,它提供了有关上呼吸道的整个中标平面的动态信息,并具有高“相对”空间分辨率。在这项工作中,我们估计了使用MGC-LSP光谱特征作为输入,估计了依赖扬声器AAI的人声道的中间RTMRI图像。我们应用了FC-DNN,CNN和复发性神经网络,并表明LSTMS最适合此任务。作为客观评估,我们测量了标准化的MSE,结构相似性指数(SSIM)及其复杂小波版本(CW-SSIM)。结果表明,FC-DNN和LSTM的组合可以实现声带的平滑生成的MR图像,这与原始MRI记录相似(平均CW-SSIM:0.94)。
Acoustic-to-articulatory inversion (AAI) methods estimate articulatory movements from the acoustic speech signal, which can be useful in several tasks such as speech recognition, synthesis, talking heads and language tutoring. Most earlier inversion studies are based on point-tracking articulatory techniques (e.g. EMA or XRMB). The advantage of rtMRI is that it provides dynamic information about the full midsagittal plane of the upper airway, with a high 'relative' spatial resolution. In this work, we estimated midsagittal rtMRI images of the vocal tract for speaker dependent AAI, using MGC-LSP spectral features as input. We applied FC-DNNs, CNNs and recurrent neural networks, and have shown that LSTMs are the most suitable for this task. As objective evaluation we measured normalized MSE, Structural Similarity Index (SSIM) and its complex wavelet version (CW-SSIM). The results indicate that the combination of FC-DNNs and LSTMs can achieve smooth generated MR images of the vocal tract, which are similar to the original MRI recordings (average CW-SSIM: 0.94).