MCnet
- Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for video prediction
- two encoders, one is Content Encoder to capture the spatial layout of an image, and the other is Motion Encoder to model temporal dynamics within video clips.
- The spatial features and temporal features are concatenated to feed to the decoder to generate the next frame
- separately modeling temporal and spatial features, this model can effectively generate future frames recursively.
- Videos consist of various lengths of frames which have rich spatial and temporal information
- inherent temporal information within videos can be used as supervision signal for self-supervised feature learning
- pretext tasks have been proposed by utilizing temporal context relations including temporal order verification [29], [40], [90] and temporal order recognition [27], [39]