Human intelligence is multimodal; they integrate visual, linguistic, and acoustic signals to maintain a holistic worldview
Most current pretraining methods, however, are limited to one or two modalities.
jointly learns representations for vision, language and speech into a unified, shared and general-purpose vector representation
data from each [modality] are first given to pretrained single-[modality](modality] are first given to pretrained single-[modality.md) encoder
The encoder outputs are then integrated with a multimodal fusion network, which uses novel Attention mechanisms and other architectural innovations to effectively combine information from the different modalities
new objectives including (i) masked [modality] modeling and (ii) cross-[modality](modality] modeling and (ii) cross-[modality.md) contrastive learning
pretraining on dual-[modality] datasets can also yield competitive or even better performance than pretraining on videos, the data resource that previous three-[modality](modality] datasets can also yield competitive or even better performance than pretraining on videos, the data resource that previous three-[modality.md) models were restricted to
dynamically process single, dual, and triple-Modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space
mechanisms that merge and cross the Attention scores of different modalities, namely merge-Attention (based on self-Attention) and co-Attention (based on self- and cross-Attention) respectively