Cross Modal-based Methods Downstream Task Ego-motion Feature Map Visualization Free Semantic Label-based Method Human Action Recognition Image Classification Image Generation with Colorization Image Generation with Inpainting Image Generation with Super Resolution Image Generation Kernel Visualization Learning from RGB-Flow Correspondence Learning from Video Colorization Learning from Video Prediction Learning from Visual-Audio Correspondence Learning with Context Similarity Learning with Labels Generated by Game Engines Learning with Labels Generated by Hard-code Programs Learning with Spatial Context Structure Nearest Neighbor Retrieval Object Detection Pretext Task Pretext Tasks Pseudo Label Self Supervised Survey Self-supervised Learning Semantic Segmentation Semi-Supervised Learning Formulation Spatial Context Structure Spatiotemporal Convolutional Neural Network Supervised Learning Formulation Temporal Context Structure Temporal order recognition Temporal order verification Video Generation Weakly Supervised Learning Formulation Weakly-supervised Learning