Generative Spoken Language Modeling

  • Generative Spoken Language Modeling from Raw Audio
  • learns speech representations from CPC, Wav2Vec2.0, and HuBERT for synthesizing speech
  • task of learning the acoustic and linguistic characteristics of a language from raw audio
  • et of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation
  • set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units)
  • generative language model (trained on pseudo-text)
  • speech decoder (generating a waveform from pseudo-text)
  • trained without supervision
  • number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems