Deepfry：使用深神经网络识别声带

论文标题

Deepfry：使用深神经网络识别声带

DeepFry: Identifying Vocal Fry Using Deep Neural Networks

论文作者

Chernyak, Bronya R., Simon, Talia Ben, Segal, Yael, Steffman, Jeremy, Chodroff, Eleanor, Cole, Jennifer S., Keshet, Joseph

论文摘要

声带煎炸或吱吱作响的声音是指的是以不规则的震颤开口和低音的特征。它以各种语言发生，并且在美国英语中很普遍，不仅用于标记词组结局，而且还用于社会语言因素和影响。由于其不规则的周期性，吱吱作响的声音挑战自动语音处理和识别系统，尤其是对于经常使用吱吱作响的语言。本文提出了一个深度学习模型，以检测流利的语音中的吱吱作响的声音。该模型由编码器和经过训练的分类器组成。编码器采用原始波形并使用卷积神经网络学习表示。分类器被实现为多头完全连接的网络，该网络训练有素，可检测吱吱作响的声音，发声和音调，最后两个用于完善吱吱作响的预测。该模型经过对美国英语说话者的言语进行培训和测试，并由训练有素的语音家注释。我们使用两个编码器评估了系统的性能：一个是为任务量身定制的，另一个是基于最新的无监督表示。结果表明，与看不见的数据相比，我们表现最佳的系统的回忆和F1得分有所改善。

Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly for languages where creak is frequently used. This paper proposes a deep learning model to detect creaky voice in fluent speech. The model is composed of an encoder and a classifier trained together. The encoder takes the raw waveform and learns a representation using a convolutional neural network. The classifier is implemented as a multi-headed fully-connected network trained to detect creaky voice, voicing, and pitch, where the last two are used to refine creak prediction. The model is trained and tested on speech of American English speakers, annotated for creak by trained phoneticians. We evaluated the performance of our system using two encoders: one is tailored for the task, and the other is based on a state-of-the-art unsupervised representation. Results suggest our best-performing system has improved recall and F1 scores compared to previous methods on unseen data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题