Imagen

  • better top-1 accuracy on ImageNet than EfficientNet at similar latency
  • Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
  • text-to-image diffusion model
  • large Transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation
  • Imagen produces samples with unprecedented photorealism and alignment with text
  • generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis
  • increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model
  • FID score
  • COCO