Learning with Labels Generated by Hard-code Programs

  • Applying hard-code programs is another way to automatically generate semantic labels such as salience, foreground masks, contours, depth for images and videos
  • very large-scale datasets with generated semantic labels can be used for self- supervised feature learning
  • Various hard-code programs have been applied to generate labels for self- supervised learning methods include methods for foreground object segmentation [81], edge detection [47], and relative depth prediction [92]
  • Pathak et al. proposed to learn features by training a ConvNet to segment foreground objects in each frame of a video while the label is the mask of moving objects in videos [81]
  • Li et al. proposed to learn features by training a ConvNet for edge prediction while labels are motion edges obtained from flow fields
  • After GAN-based methods obtained breakthrough results in image generation, researchers employed GAN to generate videos [85], [86], [144]
  • VideoGAN
    • To model the motion of objects in videos, a two-stream network is proposed for video generation while one stream is to model the static regions in in videos as background and another stream is to model moving object in videos as foreground
    • Videos are generated by the combination of the foreground and background streams
    • each random variable in the latent space represents one video clip
    • Tulyakov et al. argues that this assumption increases difficulties of the generation
  • MocoGAN
    • use the combination of two subspace to represent a video by disentangling the # context and motions in videos [86]
    • context space which each variable from this space represents one identity
    • motion space while the trajectory in this space represents the motion of the identity
    • With the two sub-spaces, the network is able to generate videos with higher inception score.
    • The generator learns to map latent vectors from latent space into videos, while discriminator learns to distinguish the real world videos with generated videos.
    • After the video generation training on large-scale unlabeled dataset finished, the parameters of discriminator can be transferred to other downstream tasks [85].