ConvNeXt

  • @liuConvNet2020s2022
  • modifying a standard Res Net , following design choices closely inspired by Vision Transformer
  • A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation
  • hierarchical Transformers (e.g., Swin Transformer ) that reintroduced several Conv priors, making Transformers practically viable as a generic vision backbone
  • effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions
  • extending the number of epochs, using AdamW optimizer, Stochastic Depth, Label Smoothing
  • number of blocks in each stage (stage compute ratio), which was adjusted from (4, 4, 6, 3) to (3, 3, 9, 3)
  • The second is the stem cell configuration, which in the original ResNet consisted of 7×7 convolutions with stride 2 followed by a max-Pooling layer. This was substituted by a more Transformer-like “patchify” layer which utilizes 4×4 non-overlapping convolutions with stride 4
  • Depthwise Separable , which are interestingly similar to self-Attention as they work on a per-channel basis
  • higher number of channels (from 64 to 96)
  • Inverted Bottleneck: An essential configuration of Transformers is the expansion-compression rate in the MLP block (the hidden dimension is 4 times higher than the input and output dimension)
  • input is expanded using 1 \times 1 convolutions and then shrunk through depthwise convolution and 1 \times 1 convolutions
  • move the depthwise convolution before the convolution
  • 7 \times 7 window (higher values did not bring any alterations in the results
  • GELU instead of Relu , a single activation for each block (the original Transformer module has just one activation after the MLP), fewer normalization Layers, Batch Normalization substituted by Layer Normalization , and separate Downsampling layer
  • ImageNet
  • COCO
  • ADE20K
  • A case in point is multi-modal learning, in which a cross-Attention module may be preferable for modeling feature interactions across many modalities
  • Transformers may be more flexible when used for tasks requiring discretized, sparse, or structured outputs