pGLSM
- Text-Free Prosody-Aware Generative Spoken Language Modeling
- similar to how GPT-2 can generate coherent paragraphs
- builds upon
- addresses the generative aspects of speech pre-training
- replacing text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences
- the units used in GSLM discard most of the prosodic information
- fails to leverage prosody for better comprehension, and does not generate expressive speech
- prosody-aware generative spoken language model (pGSLM)
- multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveform