Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs
Typically, in addition to the [Masked Language Modeling] (MLM) loss, alignment-based objectives are used for cross-[modality](Masked Language Modeling] (MLM) loss, alignment-based objectives are used for cross-[modality.md) interaction, and RoI feature regression and classification tasks for Masked ImageRegion Modeling (MIRM)
Alignment-based objectives require pairings of image and text and heuristic objective functions
Masking policies either do not take advantage of multi-Modality or are strictly coupled with alignments generated by other models
pre-trained using two pre-training tasks as a multi-loss objective given a mini-batch of image-text pairs: [Masked Language Modeling] (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with [Modality](Masked Language Modeling] (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with [Modality.md) Aware Masking (MAM)
determines the masking probability and applies masking to both word and image Embedding
based on BERT predict the masked words from available words and image regions
follow BERT for this task: two-layer MLP MLM head outputting Logits over the vocabulary
MLM loss is negative log-likelihood for masked word
RECON loss is an an average of pixel-wise sum of squared errors (SSE)
Both image and word masking is realized by replacing an Embedding with the Embedding of [MASK]
’s Embedding as a special Embedding that needs to be “filled in”, independent of the Modality, by attending to other vectors in the layer inputs
unlike other architectures (LXMERT, UNiTER, ViLBERT, VLP, VL-BERT, VisualBERT, etc.), image masking is not based on image regions detected by the object detector, but a shallow CNN as an image embedder which is much more lightweight than deep models like ResNet and is designed to be masking friendly
MLM + RECON losses apply only to the masked text/image areas and measure reconstructed text and image quality.
no specific alignment loss
[Modality] Aware Masking (MAM) to boost cross-[modality](Modality] Aware Masking (MAM) to boost cross-[modality.md) interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality
Since the the task of finding closely-matching (CM) item pairs requires a pair of image+text inputs, they exploit this multi-Modality by employing Modality Dropout
text-only, image-only, and image-text mode
However, RECON instead of ITM Loss offers better PR AUC
Similarly, using the ITM Loss together with MLM and RECON does not change the performance