VL-BEIT
- VL-BEIT: Generative Vision-Language Pretraining
- vision-language foundation model
- simple and effective approach to pretraining a bidirectional multimodal Transformer encoder for both vision-language and vision tasks learned by generative pretraining
- conducts masked prediction on both monomodal and multimodal data with a shared Transformer
- solely employs generative pretraining tasks, including [masked language modeling](masked language modeling.md) on texts, masked image modeling on images, and masked vision-language modeling on image-text pairs
- learned from scratch with one unified pretraining task, one shared backbone, and one-stage training which renders it conceptually simple and empirically effective
- transferable visual Features