Generative Spoken Language Modeling
- Generative Spoken Language Modeling from Raw Audio
- learns speech representations from CPC, Wav2Vec2.0, and HuBERT for synthesizing speech
- task of learning the acoustic and linguistic characteristics of a language from raw audio
- et of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation
- set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units)
- generative language model (trained on pseudo-text)
- speech decoder (generating a waveform from pseudo-text)
- trained without supervision
- number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems