StackGAN

:::section{.abstract}

Overview

In image generation, the GAN architecture is one of the best ones. The StackGAN architecture addresses some of the flaws of basic GANs by decomposing the task of generating images into multiple parts. This article will focus on the training paradigm proposed by StackGAN and take an in-depth look at its architecture. ::: :::section{.scope}

Scope

  • The scope of this article is to provide an introduction to StackGANs.
  • The article focuses on understanding how a StackGAN works, how it differs from other GANs and where We can use it.
  • Once these concepts are understood, the architecture is explained in detail with all the different stages and how they are linked.

::: :::section{.main}

Introduction

Generating novel photorealistic images is a huge part of the computer vision process in many fields, such as photo editing, design and other graphics-related work. Many attempts to generate high-resolution images have been made in the past, and StackGAN is one of the major ones. Instead of performing the generation task in one go as most existing architectures do, a StackGAN uses two separate GANs. The authors of the StackGAN made this architectural choice to replicate how a human artist paints a picture. They first start with a rough sketch and a colour blockout, then move on to refining the details of the sketch. They then add more information based on the description of what they want to paint. This article looks at StackGAN and how and why it works. The multi-stage architecture and the loss functions are also explained, along with the need to have such a modification to the GAN paradigm.

Some of the images generated by StackGAN given their descriptions are shown below.

[IMAGE {1} Example Images START SAMPLE] Example Images [IMAGE {1} FINISH SAMPLE]

What is a StackGAN

The StackGAN is a multi-modal network that can produce much higher quality images than many networks by first generating a low-quality image and then rectifying it to increase the image’s resolution. A StackGAN is a modification of the general GAN training paradigm were generating new objects is split into sub-tasks. This split makes training the network much easier. The StackGAN research paper also introduces a technique called “Conditioning augmentation” that produces better results.

Pre Requisites

Before understanding the StackGAN architecture, we need to understand the concept of Conditional GANs (CGANs). In a CGAN, the Generator and Discriminator are given conditioning variables alongside the input. These enable the Generator to create images influenced by these variables. This conditioning is formulated as for the Generator and the Discriminator, respectively.

Architecture

The StackGAN comprises two parts - Stage I and Stage II GANs. The first stage generates low-quality images by “sketching” a primitive shape and colouring the image with a simple colour blockout based on the text description provided. The background is generated from random noise. The second stage corrects defects in the output of the first stage by re-reading the provided description and then completing the details the first phase missed. The output of the second stage is thus a high-resolution image.

[IMAGE {2} Architecture START SAMPLE] Architecture [IMAGE {2} FINISH SAMPLE]

Stage I GAN

Stage I of the GAN is focused on generating a rough sketch with simple colours from the description.

Architecture

It is first fed into a fully connected layer(FC) to understand the embedding. The output of this FC layer is then passed to the Generator, which attempts to learn how to create the image better. The Discriminator takes the text embeddings and compresses them to a smaller representation using another FC layer. The image is also passed through a few Downsampling blocks until it is a size the network can use. The final down-sampled image is combined with the text embedding and passed through a 1x1 convolutional layer. The final layer is another FC layer that returns the probability of the generated image being real or fake.

Loss functions

Consider the text embeddings of the required description as . The meaning of the text embeddings is sampled from the Gaussian conditioning variables . Stage I first trains the Discriminator and then the Generator to alternatively maximize the discriminator and generator losses. The equations for these loss functions are as follows.

is a noise vector that is randomly sampled from the Gaussian distribution . A regularizing parameter is provided in the variable. The StackGAN research uses a for the paper.

Stage II GAN

The Stage II GAN receives the output of the Stage I GAN and refines it by re-considering the descriptions.

Architecture

The StackGANs Stage II Generator follows an encoder-decoder architecture with residual blocks. Text embedding is used first to create the conditioning variables. The result of Stage I is passed through Downsampling layers and then concatenated with the features obtained from the text embedding. The output of these layers is then upsampled to generate high-resolution images. The Discriminator architecture is almost identical to Stage I, except for a few extra down-sampling layers. These down-sampling layers were included as the output of this part of the network is of a higher resolution than Stage I.

Loss functions

If the low-resolution image is given by and the Gaussian sampled latent variables are given by , the Discriminator and Generator are trained by alternatively maximizing the value of the Discriminator loss and minimizing the Generator loss. The equations for these loss functions are the same as the Stage I GAN, except the low-resolution image is used instead of the noise . It is also to be noted that the noise is not used in Stage II as the StackGAN is meant to preserve the required randomness with the previous stage. A different FC layer is also used here that generates different statistical outputs compared to Stage I to learn better features.

More architectural details

Some architectural details were also mentioned in the StackGAN research. These details apply to both the Generator and the Discriminator.

  • The up-sampling blocks are composed of nearest neighbour upsampling and then passed to a 3x3 stride one convolutional layer. Besides the final layer, Batch Normalization and the ReLU activation are applied after every convolution.
  • The residual blocks have 3x3 stride 1 convolutions.
  • The StackGAN model that generates 256x256 images has four residual blocks, while the one that generates 128x128 images has only two blocks.
  • The down-sampling blocks have 4x4 stride two convolutions and LeakyReLU instead of ReLU.
  • The first Downsampling block does not have a Batch Normalization layer.

Embedding

Contrary to other networks where the text embeddings are transformed using non-linear techniques, the StackGAN uses additional variables and a process called Conditioning Augmentation. These embeddings are more robust to minor changes in the data manifold and work with lesser image data.

Conditioning Augmentation

Conditioning Augmentation is one of the major contributions of the StackGAN research. Given a text description , the StackGAN uses an embedding to convert it to input for the Generator. Under circumstances where the data is limited, the latent space of the embedding is not fully exploited and leads to changes in the data manifold. These changes in the manifold are not desirable and hurt performance.

Conditioning Augmentation uses these to create more training pairs from a small subset of data. Instead of using a fixed conditioning variable, StackGAN samples latent variables from a Gaussian distribution. The mean and Covariance matrix is generated for a text embedding .

The secondary objective of Conditioning Augmentation is to encourage reducing changes in the output with small changes in the data manifold. To do this, StackGAN uses a Regularization term called KL Divergence as part of the Generator. This is given by,

Need for StackGAN

Even though generating novel photorealistic images is easy enough to do with a GAN such as DCGAN, generating higher-resolution images is a hard problem. Previously failed approaches have tried stacking more up-sampling layers. StackGAN, using the decomposition of the generation and refinement tasks, can generate 256x256 images. The StackGAN training paradigm can be used with existing GANs to improve performance as it generates higher image sizes.
::: :::section{.summary}

Conclusion

In this article, we looked at StackGAN and all its components.

  • We understood how to decompose the task of generating novel images using a StackGAN.
  • We looked at the architectural details of the StackGAN, its’ embeddings and the respective training stages.
  • We also explored Conditional Augmentation and understood why it was proposed. :::