AudioSet
- 2, 084, 320 human-labeled 10-second sound clips drawn from YouTube videos covers ontology of 632 audio event classes
- The event classes cover a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sound
- selfsupervised learning from video and audio consistence