seeks to improve performance on various joint vision-and-language downstream tasks
Current approaches to VLP heavily rely on image feature extraction processes using convolutional visual Embedding networks (e.g., Faster R-CNN and ResNets), which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet)
This is problematic in terms of both efficiency/speed, in that extracting input Features requires much more computation than the multimodal interaction steps; and expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary.
minimal VLP model, which is monolithic in that the processing of visual inputs is drastically simplified to just the same convolution-free manner that they process textual inputs
removing the need for object detectors
avoiding heavyweight image encoders by directly Embedding low-level pixel data with a single-layer projection and achieves similar results with reduced complexity,
Self-supervision is accomplished using (i) Image Text Matching (ITM) loss and (ii) Masked Language Model (MLM) loss
ViLT is over 10x faster than previous VLP models, yet with competitive or better downstream task performance
VLP needs to focus more on the multi-Modality interactions aspect inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders