Learning from Visual-Audio Correspondence

  • correspondence between visual and audio streams to design VisualAudio Correspondence learning task [25], [26], [93], [154].
  • two subnetworks
  • vision
  • audio subnetwork
  • input of vision subnetwork is a single frame or a stack of image frames and the vision subnetwork learns to capture visual features of the input data
  • audio network is a 2DConvNet
  • input is the Fast Fourier Transform (FFT) of the audio from the video
  • Positive data are sampled by extracting video frames and audio from the same time of one video, while negative training data are generated by extracting video frames and audio from different videos or from different times of one video
  • networks are trained to discover the correlation of video data and audio data to accomplish this task.
  • inputs of the ConvNets are two kinds of data, the networks are able to learn the two kinds of information jointly by solving the pretext task.