AudioSet

  • 2, 084, 320 human-labeled 10-second sound clips drawn from YouTube videos covers ontology of 632 audio event classes
  • The event classes cover a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sound
  • selfsupervised learning from video and audio consistence