3D convolution operation was first proposed in 3DNet [62] for human action recognition
Compared to 2DConvNets which individually extract the spatial information of each frame and then fuse them together as video features, 3DConvNets are able to simultaneously extract both spatial and temporal features from multiple frames.
VGG-like 11-layer 3DConvNet designed for human action recognition
The network contains 8 convolutional layers, and 3 fully connected layers. All the kernels have the size of 3 x 3 x 3, the convolution stride is fixed to 1 pixel
The input of C3D is 16 consecutive RGB frames where the appearance and temporal cues from 16-frame clips are extracted
However, the paper of long-term temporal convolutions (LTC) [67] argues that, for # the long-lasting actions, 16 frames are insufficient to represent whole actions which last longer.