MLIM

  • MLIM: Vision-and-language Model Pre-training with Masked Language and Image Modeling
  • Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs
  • Typically, in addition to the [Masked Language Modeling] (MLM) loss, alignment-based objectives are used for cross-[modality](Masked Language Modeling] (MLM) loss, alignment-based objectives are used for cross-[modality.md) interaction, and RoI feature regression and classification tasks for Masked ImageRegion Modeling (MIRM)
  • Alignment-based objectives require pairings of image and text and heuristic objective functions
  • Masking policies either do not take advantage of multi-Modality or are strictly coupled with alignments generated by other models
  • pre-trained using two pre-training tasks as a multi-loss objective given a mini-batch of image-text pairs: [Masked Language Modeling] (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with [Modality](Masked Language Modeling] (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with [Modality.md) Aware Masking (MAM)
  • determines the masking probability and applies masking to both word and image Embedding
  • based on BERT predict the masked words from available words and image regions
  • follow BERT for this task: two-layer MLP MLM head outputting Logits over the vocabulary
  • MLM loss is negative log-likelihood for masked word
  • RECON loss is an an average of pixel-wise sum of squared errors (SSE)
  • Both image and word masking is realized by replacing an Embedding with the Embedding of [MASK]
  • transformer Layers recognize [MASK]
  • ’s Embedding as a special Embedding that needs to be “filled in”, independent of the Modality, by attending to other vectors in the layer inputs
  • unlike other architectures (LXMERT, UNiTER, ViLBERT, VLP, VL-BERT, VisualBERT, etc.), image masking is not based on image regions detected by the object detector, but a shallow CNN as an image embedder which is much more lightweight than deep models like ResNet and is designed to be masking friendly
  • MLM + RECON losses apply only to the masked text/image areas and measure reconstructed text and image quality.
  • no specific alignment loss
  • [Modality] Aware Masking (MAM) to boost cross-[modality](Modality] Aware Masking (MAM) to boost cross-[modality.md) interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality
  • Since the the task of finding closely-matching (CM) item pairs requires a pair of image+text inputs, they exploit this multi-Modality by employing Modality Dropout
  • text-only, image-only, and image-text mode
  • However, RECON instead of ITM Loss offers better PR AUC
  • Similarly, using the ITM Loss together with MLM and RECON does not change the performance