可分离和扩张的卷积检测声音事件检测

论文标题

可分离和扩张的卷积检测声音事件检测

Sound Event Detection with Depthwise Separable and Dilated Convolutions

论文作者

Drossos, Konstantinos, Mimilakis, Stylianos I., Gharib, Shayan, Li, Yanxiong, Virtanen, Tuomas

论文摘要

最先进的声音事件检测（SED）方法通常采用一系列卷积神经网络（CNN）从输入音频信号中提取有用的特征，然后再进行复发的神经网络（RNN）来对提取的特征中的较长时间上下文进行建模。 CNN的通道数和RNN的重量矩阵的大小对SED方法的总参数总量有直接影响，即对数百万。此外，通常用作SED方法的输入的长序列以及使用RNN，引入了含义，例如增加训练时间，梯度流动的难度以及阻碍SED方法的并行化。为了解决所有这些问题，我们建议用深度可分离的卷积替换CNN，并用扩张的卷积替换RNN。我们将所提出的方法与SED任务上的基线卷积神经网络进行了比较，并将参数量的量减少了85％，每个时期的平均训练时间降低了78％，并将平均帧的F1得分和平均误差率降低了4.6％和3.8％。

State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions. Additionally, the usually long sequences that are used as an input to an SED method along with the employment of an RNN, introduce implications like increased training time, difficulty at gradient flow, and impeding the parallelization of the SED method. To tackle all these problems, we propose the replacement of the CNNs with depthwise separable convolutions and the replacement of the RNNs with dilated convolutions. We compare the proposed method to a baseline convolutional neural network on a SED task, and achieve a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction of the average error rate by 4.6% and 3.8%, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题