VGGish
- CNN Architectures for Large-Scale Audio Classification
- applying various state-of-the-art image networks with CNN architectures to audio and show that they are capable of excellent results on audio classification
- examine fully connected deep neural networks such as AlexNet, VGG, InceptionNet, and ResNet
- The input audio is divided into non-overlapping 960 ms frames which are decomposed by applying the Fourier transform, resulting in a Spectrogram
- Spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log-transformed
- gives log-mel Spectrogram patches that are passed on as input to all classifiers
- Acoustic Event Detection
- train a classifier on embeddings learned from the video-level task on AudioSet
- model for AED with embeddings learned from these classifiers does much better than raw Features on the Audio Set AED classification task
- derivatives of image classification networks do well on the audio classification task
- increasing the number of labels they train on provides some improved performance over subsets of labels
- performance of models improves as they increase training set size,
- model using embeddings learned from the video-level task do much better than a baseline on the AudioSet classification task