The key idea of this paper is to improve speed of generating waveform from spectrogram. Authors use GAN with CNN-only architecture which is optimized for GPU. Both generator and discriminator is very lightweight in comparison to previous SOTA approaches like wave net.
In generator we reduce dimensionality layer by layer and also to prevent gradient vanish we use residual stack. In discriminator we instead increasing dimensionality layer by layer. Also we features from every layer of discriminator. Also we should use 3 discriminators instead of 1. For every next discriminator we downsample input by 2 with average pooling.
In my first experiments I didn't add weight normalization, it leads to instability in losses also generated results wasn't good. So, for every convolutional (all layers are convolutional) we apply weight normalization.
The basic loss is taken from LS GAN paper. The main difference between vanilla GAN loss is that we don't use sigmoid function for output. The loss function is:
$$ \begin{array}{l}\min {D{k}} \mathbb{E}{x}\left[\min \left(0,1-D{k}(x)\right)\right]+\mathbb{E}{s, z}\left[\min \left(0,1+D{k}(G(s, z))\right)\right], \forall k=1,2,3 \ \min {G} \mathbb{E}{s, z}\left[\sum_{k=1,2,3}-D_{k}(G(s, z))\right]\end{array} $$
Also, to improve generator convergence we also use L1 distance for features from discriminator between real and generated audio.
$$ \mathcal{L}{\mathrm{FM}}\left(G, D{k}\right)=\mathbb{E}{x, s \sim p{\text {data }}}\left[\sum_{i=1}^{T} \frac{1}{N_{i}}\left|D_{k}^{(i)}(x)-D_{k}^{(i)}(G(s))\right|_{1}\right] $$