toc: true title: BEiT

tags: [‘temp’]


BEiT

  • BEiT: BERT Pre-Training of Image Transformers
    • Self Supervised pre-trained representation model
    • Bidirectional Encoder Decoder Attention representations from Vision Transformer
    • masked image modeling task to pretrain vision Transformers
    • each image has two views in their pre-training
    • the embeddings of which are calculated as linear projections of flattened patches
    • visual tokens
    • discrete VAE (dVAE) which acts as an “image Tokenizer” learnt via autoencoding-style reconstruction
    • input image is tokenized into discrete visual tokens obtained by the latent codes of the discrete VAE
    • proposed method is critical to make BERT like pre-training (i.e., auto-encoding with masked input) work well for image Transformers
    • automatically acquired knowledge about semantic regions, without using any human-annotated data
    • randomly masks some image patches and feeds them into the backbone Transformer
    • pre-training objective is to recover the original visual tokens based on the corrupted image patches
    • directly fine-tune the model parameters on downstream tasks by appending task Layers upon the pretrained encoder
    • ImageNet
    • outperforming from-scratch DeiT