Masked Autoencoders

  • Masked Autoencoders are Scalable Vision Learners
  • simple Self Supervised
  • ImageNet and in Transfer Learning that an Auto Encoders —- a simple self-supervised method similar to techniques in NLP – provides scalable benefits
  • mask random patches of the input image and reconstruct the missing pixels
  • asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens
  • images and languages are signals of a different nature
  • Images are merely recorded light without a semantic decomposition into the visual analogue of words
  • The word (or subword) analog for images are pixels
  • But decomposing the image into patches (like Vision Transformer reduces the quadratic computation cost of transformers compared to operating at the pixel level
  • remove random patches that most likely do not form a semantic segment
  • Likewise, MAE reconstructs pixels, which are not semantic entities
  • hey find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task
  • train and throw away the decoder and fine-tune the encoder for downstream tasks
  • Vanilla ViT-Huge model (ViTMAE) achieves the best accuracy
  • ImageNet
  • Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior
  • semantics
  • Occurs by way of a rich hidden representation inside the MAE