Masked Autoencoders

Masked Autoencoders are Scalable Vision Learners
simple Self Supervised
ImageNet and in Transfer Learning that an Auto Encoders —- a simple self-supervised method similar to techniques in NLP – provides scalable benefits
mask random patches of the input image and reconstruct the missing pixels
asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens
images and languages are signals of a different nature
Images are merely recorded light without a semantic decomposition into the visual analogue of words
The word (or subword) analog for images are pixels
But decomposing the image into patches (like Vision Transformer reduces the quadratic computation cost of transformers compared to operating at the pixel level
remove random patches that most likely do not form a semantic segment
Likewise, MAE reconstructs pixels, which are not semantic entities
hey find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task
train and throw away the decoder and fine-tune the encoder for downstream tasks
Vanilla ViT-Huge model (ViTMAE) achieves the best accuracy
ImageNet
Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior
semantics
Occurs by way of a rich hidden representation inside the MAE

Subhaditya's KB