论文标题
BW-EDA-EEND:流式传输端到端的神经扬声器诊断可变数量的扬声器
BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers
论文作者
论文摘要
我们提出了一个新颖的在线端到端神经诊断系统BW-Eda-eend,该系统可逐步处理可变数量的扬声器的数据。该系统基于Horiguchi等人的编码器 - 解码器(EDA)架构,但使用增量变压器编码器,仅在其左上下文上进行,并在隐藏状态中使用块级复发以将信息从Block带到Block到Block到Block到Block,从而使算法的复杂性在时光呈线性线性。我们提出了两个变体:对于无限的延迟bw-eDa-eend,在线性时间内处理输入,我们仅使用10秒钟的上下文大小显示与离线EDA-EEND相比,最多显示两个扬声器的中等降解。有了两个以上的扬声器,在线和离线之间的准确性差距会增长,但是该算法仍然优于一个无限上下文尺寸的一到四个扬声器的基线离线聚类诊断系统,并且与上下文大小相当的精度与10秒钟相当。对于有限的延迟BW-EDA-EEND,会在音频到达时产生诊断输出的诊断输出,我们显示的准确性与基于离线聚类的系统相当。
We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. We propose two variants: For unlimited-latency BW-EDA-EEND, which processes inputs in linear time, we show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited context size, and shows comparable accuracy with context size of 10 seconds. For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, we show accuracy comparable to the offline clustering-based system.