toc: true title: BEiT
tags: [‘temp’]
BEiT
- BEiT: BERT Pre-Training of Image Transformers
- Self Supervised pre-trained representation model
- Bidirectional Encoder Decoder Attention representations from Vision Transformer
- masked image modeling task to pretrain vision Transformers
- each image has two views in their pre-training
- the embeddings of which are calculated as linear projections of flattened patches
- visual tokens
- discrete VAE (dVAE) which acts as an “image Tokenizer” learnt via autoencoding-style reconstruction
- input image is tokenized into discrete visual tokens obtained by the latent codes of the discrete VAE
- proposed method is critical to make BERT like pre-training (i.e., auto-encoding with masked input) work well for image Transformers
- automatically acquired knowledge about semantic regions, without using any human-annotated data
- randomly masks some image patches and feeds them into the backbone Transformer
- pre-training objective is to recover the original visual tokens based on the corrupted image patches
- directly fine-tune the model parameters on downstream tasks by appending task Layers upon the pretrained encoder
- ImageNet
- outperforming from-scratch DeiT