ViLT

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Vision-and-Language Transformer
seeks to improve performance on various joint vision-and-language downstream tasks
Current approaches to VLP heavily rely on image feature extraction processes using convolutional visual Embedding networks (e.g., Faster R-CNN and ResNets), which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet)
This is problematic in terms of both efficiency/speed, in that extracting input Features requires much more computation than the multimodal interaction steps; and expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary.
minimal VLP model, which is monolithic in that the processing of visual inputs is drastically simplified to just the same convolution-free manner that they process textual inputs
removing the need for object detectors
avoiding heavyweight image encoders by directly Embedding low-level pixel data with a single-layer projection and achieves similar results with reduced complexity,
Self-supervision is accomplished using (i) Image Text Matching (ITM) loss and (ii) Masked Language Model (MLM) loss
ITM Loss
For text, ViLT simply reuses Masked Language Model - (MLM), used in BERT.
MSCOCO
Visual Genome
SBU Captions
Google Conceptual Captions
VQAv2
NLVR2
Flickr30K
ViLT is over 10x faster than previous VLP models, yet with competitive or better downstream task performance
VLP needs to focus more on the multi-Modality interactions aspect inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders

Subhaditya's KB

ViLT

ViLT

Graph View

Backlinks