i-Code

  • i-Code: an Integrative and Composable Multimodal Learning Framework
  • Human intelligence is multimodal; they integrate visual, linguistic, and acoustic signals to maintain a holistic worldview
  • Most current pretraining methods, however, are limited to one or two modalities.
  • jointly learns representations for vision, language and speech into a unified, shared and general-purpose vector representation
  • data from each [modality] are first given to pretrained single-[modality](modality] are first given to pretrained single-[modality.md) encoder
  • The encoder outputs are then integrated with a multimodal fusion network, which uses novel Attention mechanisms and other architectural innovations to effectively combine information from the different modalities
  • new objectives including (i) masked [modality] modeling and (ii) cross-[modality](modality] modeling and (ii) cross-[modality.md) contrastive learning
  • pretraining on dual-[modality] datasets can also yield competitive or even better performance than pretraining on videos, the data resource that previous three-[modality](modality] datasets can also yield competitive or even better performance than pretraining on videos, the data resource that previous three-[modality.md) models were restricted to
  • dynamically process single, dual, and triple-Modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space
  • GLUE
  • merge-attention Layers and (b) co-Attention Layers
  • fusion architecture
  • mechanisms that merge and cross the Attention scores of different modalities, namely merge-Attention (based on self-Attention) and co-Attention (based on self- and cross-Attention) respectively