Abstract

目前语音增强的对抗生成网络仅依赖于卷积运算，这可能会掩盖输入序列中的时间依赖性。为解决该问题，提出一种适应非局部注意力的注意力层，并结合时域语音增强GAN的卷积和反卷积。实验结果显示，将自注意力引入SEGAN会让客观评估指标持续改进。

Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input. To remedy this issue, we propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN (SEGAN) using raw signal input. Further, we empirically study the effect of placing the self-attention layer at the (de)convolutional layers with varying layer indices as well as at all of them when memory allows. Our experiments show that introducing self-attention to SEGAN leads to consistent improvement across the objective evaluation metrics of enhancement performance. Furthermore, applying at different (de)convolutional layers does not significantly alter performance, suggesting that it can be conveniently applied at the highest-level (de)convolutional layer with the smallest memory overhead.

1. Introduction

SEGAN是一种可以实现时域语音增强的网络，但其主干还是依靠卷积神经网络。由于卷积算子的局部感受野，对卷积算子的依赖限制了SEGAN在输入序列中捕获远程依赖的能力。时间依赖性建模是语音建模的一个组成部分，但在SEGAN中，有关时间依赖性是未知的。

This reliance onthe convolution operator limits SEGAN’s capability in capturing long-range dependencies across an input sequence due to the convolution operator’s local receptive field.
Temporal dependency modeling is, in general, an integral part of a speech modeling system [17, 18], including speech enhancement when input is a long segment of signal with a rich underlying structure. However, it has mostly remained uncharted in SEGAN systems.

一方面，自注意力已成功用于不同语音建模任务中的顺序建模。另一方面，它在建模远程和局部依赖方面更加灵活，并且在计算成本方面比RNN更有效，尤其是在应用于长序列时。

On the one hand, self-attention has been successfully used for sequential modeling in different speech modeling tasks. On the other hand, it is more flexible in modeling both long-range and local dependencies and is more efficient than RNN in terms of computational cost, especially when applied to long sequences.

因此，我们提出了一个遵循非局部注意原则的自注意层，并将其与 SEGAN 的（反）卷积层耦合以构建自注意 SEGAN（简称 SASEGAN）。

We, therefore, propose a self-attention layer following
the principle of non-local attention [21, 22] and couple it
with the (de)convolutional layers of a SEGAN to construct
a self-attention SEGAN (SASEGAN for short).

2. SELF-ATTENTION SEGAN

2.1 SEGAN

带有噪音的语音信号表示为 $\tilde{x}=x+n∈R^T$ ,其中 $x∈R^T$ 表示干净的语音信号， $n∈R^T$ 表示背景噪音，目标是得到映射 $f(\tilde{x}):\tilde{x}→x$ 。在SEGAN模型中，生成器 $G$ 来学习这种映射，如 $\hat{x}=G(z,\tilde{x})$ ,鉴别器 $D$ 来分辨 $(x,\tilde{x})$ 和 $(\hat{x},\tilde{x})$ 的真假。训练过程如Fig1所示。

损失函数：

2.2 Self-attention SEGAN(SASEGAN)

2.2.1 Self-attention layer

给定特征图 $F∈R^{L\times{}C}$ ,其中 $L$ 表示时间维度， $C$ 代表通道数量。query matrix 为 $Q$ ,key matrix为 $K$ ,value matrix为 $V$

每个

a_{ij}∈A

表示模型在生成

O

第i个输出

o_{i}

时关注

V

的第

j

列

v_j

的程度。

最终得到如下输出：

其中β为可学习参数。过程如FIg2所示，

2.2.2 Network architecture

在这里插入图片描述

生成器输入长度为

L = 16348

的原信号输入样本，并具有带全卷积层的编码器-解码器架构，如Fig3(a)所示。编码器由11一维卷积层组成，filter width 为31，stride 为 2，滤波器增长个数

\lbrace16,32,32,64,64,128,128,256,256,512,1024\rbrace

,则特征图的尺寸为

\lbrace8192\times16,4096\times32,2048\times32,1024\times64,512\times64,256\times128,128\times128,64\times256,32\times256,16\times512,8\times1024\rbrace

,噪声样本

z∈R^{8\times1024}

堆叠在最后一个特征图上并呈现给解码器。解码器的结构时编码器的镜像反卷积。所有的卷积层、反卷积层后使用RrLUs。为了让来自编码阶段的信息流入解码阶段，使用跳跃连接将编码器中的每个卷积层连接到解码器中的镜像解卷积层。

判别器架构如Fig(b)所示，与生成器的编码器结构类似。但输入是一对原始音频。

SASEGAN将self-attention层应用到生成器和判别器中。FIg3(a)(b)只是显示了一个实例，自注意力层可以与任何数量的卷积层、反卷积层结合使用。

3. EXPERIMENTS

3.1 Experimental setup

主要探究两点：

1.self-attention层对于语音增强效果的影响

2.分析生成器和判别器自self-attention层放置在不同位置的影响

baesline:SEGAN

将self-attention层放在不同的位置进行了实验

3.2 Dataset

Voice Bank corpus

3.3 Parameters

使用tensorflow架构实现，epoch=100，minibatch size = 50。在训练期间，从具有 50% 重叠的训练话语中对批次中的原始语音片段（每个长度为 16,384 个样本）进行采样，预加重系数为0.95

3.4 Experimental results

$S A S E G A N - l$ 表示第 $l$ 层卷积或反卷积使用self-attention层

表明在不同层上使用self-attention，没有显示出SASEGAN-

l

与SEGAN的明显差异，如fig4所示。

表明self-attention用在高层和用在低层一样好。

3.5 Discussion

在这里插入图片描述

为了可视化注意力学到的权重，以 SASEGAN-3 为例，我们在Fig 5 中展示了对应于特征图不同时间位置的生成器编码器的注意力权重。这表明，该网络利用输入的远距离部分的互补特征，而不是固定形状的局部区域来产生周到的输出。

4 CONCLUSION

我们提出了一个自我注意层并将其与 SEGAN 结合，以改进其用于语音增强的时间依赖性建模。如果有足够的处理内存，所提出的自注意力层可以用于 SEGAN 生成器和鉴别器的不同（反）卷积层，甚至全部。实验表明，自注意力 SEGAN 在所有客观评估指标上都优于 SEGAN 基线。此外，在自我注意安置设置中看到了改进的一致性。此外，这些设置并没有导致它们的性能提升之间存在显着差异。结果表明，在具有非常小的诱导记忆的高级（反）卷积层中可以充分使用自注意力。此外，它可以轻松应用于现有的 SEGAN 变体以进行潜在改进。

转载地址：http://lkgwb.baihongyu.com/

你可能感兴趣的文章

Intellij IDEA使用（一）—— 安装Intellij IDEA（ideaIU-2017.2.3）并完成Intellij IDEA的简单配置

查看>>

Intellij IDEA使用（二）—— 在Intellij IDEA中配置JDK（SDK）

查看>>

Intellij IDEA使用（三）——在Intellij IDEA中配置Tomcat服务器

查看>>

Intellij IDEA使用（四）—— 使用Intellij IDEA创建静态的web（HTML）项目

查看>>

Intellij IDEA使用（五）—— Intellij IDEA在使用中的一些其他常用功能或常用配置收集

查看>>