AudioLM

  • maps the input audio into a sequence of discrete tokens and casts audio generation as language modeling task in this representation space
  • training on large corpora of raw
  • audio waveforms
  • learns to generate natural and coherent continuations given short prompts
  • extended beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music
  • When it comes to audio synthesis, multiple scales make achieving high audio quality while displaying consistency very challenging
  • This gets achieved by this model by combining recent advances in neural audio compression, self-supervised representation learning and language modelling.