本文共 5023 字,大约阅读时间需要 16 分钟。
论文地址:https://ieeexplore.ieee.org/abstract/document/9414265
会议:ICASSP2021目前语音增强的对抗生成网络仅依赖于卷积运算,这可能会掩盖输入序列中的时间依赖性。为解决该问题,提出一种适应非局部注意力的注意力层,并结合时域语音增强GAN的卷积和反卷积。实验结果显示,将自注意力引入SEGAN会让客观评估指标持续改进。
Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input. To remedy this issue, we propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN (SEGAN) using raw signal input. Further, we empirically study the effect of placing the self-attention layer at the (de)convolutional layers with varying layer indices as well as at all of them when memory allows. Our experiments show that introducing self-attention to SEGAN leads to consistent improvement across the objective evaluation metrics of enhancement performance. Furthermore, applying at different (de)convolutional layers does not significantly alter performance, suggesting that it can be conveniently applied at the highest-level (de)convolutional layer with the smallest memory overhead.
SEGAN是一种可以实现时域语音增强的网络,但其主干还是依靠卷积神经网络。由于卷积算子的局部感受野,对卷积算子的依赖限制了SEGAN在输入序列中捕获远程依赖的能力。时间依赖性建模是语音建模的一个组成部分,但在SEGAN中,有关时间依赖性是未知的。
This reliance onthe convolution operator limits SEGAN’s capability in capturing long-range dependencies across an input sequence due to the convolution operator’s local receptive field.
Temporal dependency modeling is, in general, an integral part of a speech modeling system [17, 18], including speech enhancement when input is a long segment of signal with a rich underlying structure. However, it has mostly remained uncharted in SEGAN systems.
一方面,自注意力已成功用于不同语音建模任务中的顺序建模。另一方面,它在建模远程和局部依赖方面更加灵活,并且在计算成本方面比RNN更有效,尤其是在应用于长序列时。
On the one hand, self-attention has been successfully used for sequential modeling in different speech modeling tasks. On the other hand, it is more flexible in modeling both long-range and local dependencies and is more efficient than RNN in terms of computational cost, especially when applied to long sequences.
因此,我们提出了一个遵循非局部注意原则的自注意层 ,并将其与 SEGAN 的(反)卷积层耦合以构建自注意 SEGAN(简称 SASEGAN)。
We, therefore, propose a self-attention layer following
the principle of non-local attention [21, 22] and couple it with the (de)convolutional layers of a SEGAN to construct a self-attention SEGAN (SASEGAN for short).
带有噪音的语音信号表示为 x ~ = x + n ∈ R T \tilde{x}=x+n∈R^T x~=x+n∈RT,其中 x ∈ R T x∈R^T x∈RT表示干净的语音信号, n ∈ R T n∈R^T n∈RT表示背景噪音,目标是得到映射 f ( x ~ ) : x ~ → x f(\tilde{x}):\tilde{x}→x f(x~):x~→x。在SEGAN模型中,生成器 G G G来学习这种映射,如 x ^ = G ( z , x ~ ) \hat{x}=G(z,\tilde{x}) x^=G(z,x~),鉴别器 D D D来分辨 ( x , x ~ ) (x,\tilde{x}) (x,x~)和 ( x ^ , x ~ ) (\hat{x},\tilde{x}) (x^,x~)的真假。训练过程如Fig1所示。
损失函数:给定特征图 F ∈ R L × C F∈R^{L\times{}C} F∈RL×C,其中 L L L表示时间维度, C C C代表通道数量。query matrix 为 Q Q Q,key matrix为 K K K,value matrix为 V V V
每个 a i j ∈ A a_{ij}∈A aij∈A表示模型在生成 O O O第i个输出 o i o_{i} oi时关注 V V V的第 j j j列 v j v_j vj的程度。 最终得到如下输出: 其中β为可学习参数。过程如FIg2所示,主要探究两点:
1.self-attention层对于语音增强效果的影响 2.分析生成器和判别器自self-attention层放置在不同位置的影响 baesline:SEGAN 将self-attention层放在不同的位置进行了实验Voice Bank corpus
使用tensorflow架构实现,epoch=100,minibatch size = 50。在训练期间,从具有 50% 重叠的训练话语中对批次中的原始语音片段(每个长度为 16,384 个样本)进行采样,预加重系数为0.95
S A S E G A N − l SASEGAN-l SASEGAN−l表示第 l l l层卷积或反卷积使用self-attention层
表明在不同层上使用self-attention,没有显示出SASEGAN- l l l与SEGAN的明显差异,如fig4所示。 表明self-attention用在高层和用在低层一样好。我们提出了一个自我注意层并将其与 SEGAN 结合,以改进其用于语音增强的时间依赖性建模。如果有足够的处理内存,所提出的自注意力层可以用于 SEGAN 生成器和鉴别器的不同(反)卷积层,甚至全部。实验表明,自注意力 SEGAN 在所有客观评估指标上都优于 SEGAN 基线。此外,在自我注意安置设置中看到了改进的一致性。此外,这些设置并没有导致它们的性能提升之间存在显着差异。结果表明,在具有非常小的诱导记忆的高级(反)卷积层中可以充分使用自注意力。此外,它可以轻松应用于现有的 SEGAN 变体以进行潜在改进。
转载地址:http://lkgwb.baihongyu.com/