时间域，频域和cepstral-domain声学特征的组合，用于语音命令分类

论文标题

时间域，频域和cepstral-domain声学特征的组合，用于语音命令分类

Combination of Time-domain, Frequency-domain, and Cepstral-domain Acoustic Features for Speech Commands Classification

论文作者

Wang, Yikang, Nishizaki, Hiromitsu

论文摘要

在与语音相关的分类任务中，经常使用频域声学特征，例如对数MEL-FILTER库库库系数（Fbank）和Cepstral-domain声学特征，例如MEL频率Cepstral系数（MFCC）。但是，时间域特征在包含非声音或弱语音相关的声音的某些声音分类任务中更有效。我们先前提出了一个称为BIT序列表示（BSR）的功能，该功能是基于原始波形的时域二进制声学特征。与MFCC相比，BSR在环境声音检测方面的表现更好，并且在限量 - 唱机语音识别任务中表现出可比的精度性能。在本文中，我们提出了一种新颖的改进BSR功能，称为BSR-Float16，以更精确地表示浮点值。我们在实验上证明了使用Google提出的称为语音命令的数据集中的时间域，频域和cepstral-rain特征之间的互补性。因此，我们使用了简单的后端得分融合方法来提高最终分类精度。融合结果还显示出更好的噪声稳健性。

In speech-related classification tasks, frequency-domain acoustic features such as logarithmic Mel-filter bank coefficients (FBANK) and cepstral-domain acoustic features such as Mel-frequency cepstral coefficients (MFCC) are often used. However, time-domain features perform more effectively in some sound classification tasks which contain non-vocal or weakly speech-related sounds. We previously proposed a feature called bit sequence representation (BSR), which is a time-domain binary acoustic feature based on the raw waveform. Compared with MFCC, BSR performed better in environmental sound detection and showed comparable accuracy performance in limited-vocabulary speech recognition tasks. In this paper, we propose a novel improvement BSR feature called BSR-float16 to represent floating-point values more precisely. We experimentally demonstrated the complementarity among time-domain, frequency-domain, and cepstral-domain features using a dataset called Speech Commands proposed by Google. Therefore, we used a simple back-end score fusion method to improve the final classification accuracy. The fusion results also showed better noise robustness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题