Scene Based Text to Image Generation

  • Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
  • text-to-image generation
  • enabling a simple control mechanism complementary to text in the form of a scene
  • introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions
  • adapting classifier-free guidance for the transformer use case
  • They attempt to progress text-to-image generation towards a more interactive experience, where people can perceive more control over the generated outputs, thus enabling real-world applications such as storytelling
  • focus on improving key image aspects that are significant in human Perception, such as faces and salient objects, resulting in higher favorability of their method in human evaluations and objective metrics
  • Through scene controllability, they introduce several new capabilities: (i) scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation