simple and effective approach to pretraining a bidirectional multimodal Transformer encoder for both vision-language and vision tasks learned by generative pretraining
conducts masked prediction on both monomodal and multimodal data with a shared Transformer
solely employs generative pretraining tasks, including [masked language modeling](masked language modeling.md) on texts, masked image modeling on images, and masked vision-language modeling on image-text pairs
learned from scratch with one unified pretraining task, one shared backbone, and one-stage training which renders it conceptually simple and empirically effective