leverages knowledge from the pretrained language model GPT-2
bridge the semantic gap between diferent modalities - novel encoder-decoder attention mechanism [33] is designed with an unsaturated
rectified gating function
the biggest advantage of this model is that it does not need for as much data as other image-to-text models
improving data eciency in image captioning networks would enable quick data curation, description of rare objects, and applications in specialized domains