Phenaki

  • capable of performing realistic video synthesis, given a sequence of textual prompts
  • Phenaki is the first model that can generate videos from open domain time variable prompts
  • To address data issues, it performs joint training on a large image-text pairs dataset as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets.
  • image-text datasets having billions of inputs
  • limitations come from computational capabilities for videos of variable length
  • the C-ViViT encoder, the training transformer and the video generator
  • The encoder gets a compressed representation of videos.
  • First tokens are transformed into embeddings.
  • This is followed by the temporal transformer, then the spatial transformer
  • After the output of the spatial transformer, they apply a single linear projection without activation to map the tokens back to pixel space
  • Consequently, the model generates temporally coherent and diverse videos conditioned on open domain prompts even when the prompt is a new composition of concepts