As speech audio consists of sinusoidal signals with various periods, they demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality
shows a significant improvement in terms of synthesis speed.
MOS
characteristic of speech audio that consists of patterns with various periods and applied it to neural networks, and verified that the existence of the proposed discriminator greatly influences the quality of speech synthesis through the ablation study
generalize to the mel-Spectrogram inversion of unseen speakers and synthesize speech audio comparable to human quality from noisy inputs in an end-to-end setting
progress towards on-device natural speech synthesis, which requires low latency and memory footprint
generators of various configurations can be trained with the same discriminators and learning mechanism
possibility of flexibly selecting a generator configuration according to the target specifications without the need for a time-consuming hyper-parameter search for the discriminators