Global Convolutional Network (GCNs) (Peng et al., 2017) enlarges the kernel size to 15 by employing a combination of 1×M + M×1 and M×1 + 1×M convolutions.
However, the proposed method leads to performance degradation on ImageNet
or the utilization of varying convolutional kernel sizes to learn spatial patterns at different scales. With the popularity of VGG (Simonyan & Zisserman, 2014), it has been common over the past decade to use a stack of small kernels (1×1 or 3×3) to obtain a large receptive
field
However, the performance improvement plateaus when further expanding the kernel size
Han et al. (2021b) find that dynamic depth-wise convolution (7x7) performs on par with the local attention mechanism if we substitute the latter with the former in Swin Transformer
Liu et al. (2022b) imitate the design elements of Swin Transformer (Liu et al., 2021e) and design ConvNeXt employed with 7x7 kernels, surpassing the performance of the former
Lately, Chen et al. (2022) reveal large kernels to be feasible and beneficial for 3D networks too.
Prior works have explored the idea of paralleling (Peng et al., 2017; Guo et al., 2022a) or stacking (Szegedy et al., 2017) two complementary Mx1 and 1xM kernels
However, they limit the shorter edge to 1 and do not scale the kernel size beyond 51x51