ViLT

  • ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
  • Vision-and-Language Transformer
  • seeks to improve performance on various joint vision-and-language downstream tasks
  • Current approaches to VLP heavily rely on image feature extraction processes using convolutional visual Embedding networks (e.g., Faster R-CNN and ResNets), which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet)
  • This is problematic in terms of both efficiency/speed, in that extracting input Features requires much more computation than the multimodal interaction steps; and expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary.
  • minimal VLP model, which is monolithic in that the processing of visual inputs is drastically simplified to just the same convolution-free manner that they process textual inputs
  • removing the need for object detectors
  • avoiding heavyweight image encoders by directly Embedding low-level pixel data with a single-layer projection and achieves similar results with reduced complexity,
  • Self-supervision is accomplished using (i) Image Text Matching (ITM) loss and (ii) Masked Language Model (MLM) loss
  • ITM Loss
  • For text, ViLT simply reuses Masked Language Model - (MLM), used in BERT.
  • MSCOCO
  • Visual Genome
  • SBU Captions
  • Google Conceptual Captions
  • VQAv2
  • NLVR2
  • Flickr30K
  • ViLT is over 10x faster than previous VLP models, yet with competitive or better downstream task performance
  • VLP needs to focus more on the multi-Modality interactions aspect inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders