large Transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation
Imagen produces 1024×1024 samples with unprecedented photorealism and alignment with text
generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis
increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model