VisualGPT

  • image captioning model
  • leverages knowledge from the pretrained language model GPT-2
  • bridge the semantic gap between diferent modalities - novel encoder-decoder attention mechanism [33] is designed with an unsaturated rectified gating function
  • the biggest advantage of this model is that it does not need for as much data as other image-to-text models
  • improving data eciency in image captioning networks would enable quick data curation, description of rare objects, and applications in specialized domains