maps the input audio into a sequence of discrete tokens and casts audio generation as language modeling task in this representation space
training on large corpora of raw
audio waveforms
learns to generate natural and coherent continuations given short prompts
extended beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music
When it comes to audio synthesis, multiple scales make achieving high audio quality while displaying consistency very challenging
This gets achieved by this model by combining recent advances in neural audio compression, self-supervised representation learning and language modelling.