Large Kernel in Convolution
- Global Convolutional Network (GCNs) (Peng et al., 2017) enlarges the kernel size to 15 by employing a combination of 1×M + M×1 and M×1 + 1×M convolutions.
- However, the proposed method leads to performance degradation on ImageNet
- or the utilization of varying convolutional kernel sizes to learn spatial patterns at different scales. With the popularity of VGG (Simonyan & Zisserman, 2014), it has been common over the past decade to use a stack of small kernels (1×1 or 3×3) to obtain a large receptive
- field
- However, the performance improvement plateaus when further expanding the kernel size
- Han et al. (2021b) find that dynamic depth-wise convolution (7x7) performs on par with the local attention mechanism if we substitute the latter with the former in Swin Transformer
- Liu et al. (2022b) imitate the design elements of Swin Transformer (Liu et al., 2021e) and design ConvNeXt employed with 7x7 kernels, surpassing the performance of the former
- Lately, Chen et al. (2022) reveal large kernels to be feasible and beneficial for 3D networks too.
- Prior works have explored the idea of paralleling (Peng et al., 2017; Guo et al., 2022a) or stacking (Szegedy et al., 2017) two complementary Mx1 and 1xM kernels
- However, they limit the shorter edge to 1 and do not scale the kernel size beyond 51x51