learns speech representations from CPC, Wav2Vec2.0, and HuBERT for synthesizing speech
task of learning the acoustic and linguistic characteristics of a language from raw audio
et of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation
set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units)
generative language model (trained on pseudo-text)
speech decoder (generating a waveform from pseudo-text)
trained without supervision
number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems