correspondence between visual and audio streams to design VisualAudio Correspondence learning task [25], [26], [93], [154].
two subnetworks
vision
audio subnetwork
input of vision subnetwork is a single frame or a stack of image frames and the vision subnetwork learns to capture visual features of the input data
audio network is a 2DConvNet
input is the Fast Fourier Transform (FFT) of the audio from the video
Positive data are sampled by extracting video frames and audio from the same time of one video, while negative training data are generated by extracting video frames and audio from different videos or from different times of one video
networks are trained to discover the correlation of video data and audio data to accomplish this task.
inputs of the ConvNets are two kinds of data, the networks are able to learn the two kinds of information jointly by solving the pretext task.