Jukebox

  • generates music with singing in the raw audio domain
  • earlier models in the text-to-music genre generated music symbolically in the form of a pianoroll which specifies timing, pitch and velocity.
  • The challenging aspect is the non-symbolic approach where music is tried to be produced directly as a piece of audio
  • the space of raw audio is extremely high dimensional which makes the problem very challenging
  • the key issue is that modelling that raw audio produces long-range dependencies, making it computationally challenging to learn the high-level semantics of music.
  • hierarchical VQ-VAE architecture to compress audio into a discrete space [14], with a loss function designed to retain the most amount of information.
  • This model produces songs from very diferent genres such as rock, hip-hop and jazz.