Autoencoders with Convolutions
:::section{.abstract}
Overview
In the absence of labels in a dataset, only a few models can perform well. The Convolutional Autoencoder is a model that can be used to re-create images from a dataset, creating an unsupervised classifier and an image generator in the process. This model uses an Encoder-Bottleneck-Decoder architecture to understand the latent space of the dataset and re-create the images. ::: :::section{.scope}
Scope
- This article explains the principle behind the Convolutional Autoencoder.
- It explores the architecture of the Autoencoder.
- It presents a template we can use to implement the Autoencoder in Tensorflow.
- The article also covers the loss functions and optimizers required to train a Convolutional Autoencoder.
::: :::section{.main}
Introduction
To be able to learn how to re-create a dataset, a model must have an understanding of the underlying latent space. The Convolutional Autoencoder compresses the information in an image dataset by applying successive convolutions. This output is passed to a Bottleneck layer, the smallest possible representation of the dataset. Using this compressed representation, the Decoder attempts to recreate the original dataset. In the process of re-creation, the compressed output resembles a sort of Dimensionality Reduction procedure, while the Reconstruction Loss can be used as a classification metric. This article explores the architecture and methods behind creating a Convolutional Autoencoder.
What is an Autoencoder?
The convolutional Autoencoder is part of a family of models that can reduce the noise in data. In the noise reduction task, these models learn the latent space by reconstructing the data. Autoencoders are split into three main parts - the Encoder, the Bottleneck, and the Decoder. The Encoder compresses the input data while keeping the useful features. The Bottleneck is responsible for choosing the important features that can flow through it to the Decoder. Finally, the Decoder uses the features passed to it by the Bottleneck layer to reconstruct the input.
The architecture diagram of a convolutional autoencoder is shown below.
[IMAGE {1} { arch } START SAMPLE]
[IMAGE {1} FINISH SAMPLE]
Encoder Structure
The Encoder part of the network is to compress the input data and passes it to the Bottleneck layer. The compression creates a knowledge representation much smaller than the original input but has most of its features. This part of the network comprises blocks of convolutions followed by pooling layers that, in turn, further help to create a compressed data representation. The output of an ideal Encoder should be the same as the input but with a smaller size. The Encoder should be sensitive to the inputs to recreate it and not over-sensitive. Being over-sensitive would make the model memorize the inputs perfectly and then overfit the data.
Bottleneck layer
The Bottleneck is the most important layer of an Autoencoder. This module stores the compressed knowledge that is passed to the Decoder. The Bottleneck restricts information flow by only allowing important parts of the compressed representation to pass through to the Decoder. Doing so ensures that the input data has the maximum possible information extracted from it and the most useful correlations found. This part of the architecture is also a measure against overfitting as it prevents the network from memorizing the input data directly. Note that smaller bottlenecks lead to lesser overfitting (to an extent).
Decoder Structure
This part of the network is a “Decompressor” that attempts to recreate an image given its latent attributes. The Decoder gets the compressed information from the Bottleneck layer and then uses upsampling and convolutions to reconstruct it. The output generated by the Decoder is compared with the ground truth to quantify the network’s performance.
Latent Space Structure
The latent space of a network is the compressed representation it creates from an input dataset. This latent space usually has hundreds of dimensions and is hard to visualize directly. More complex neural networks have latent spaces so hard to visualize that they are generally referred to as black boxes. In a convolutional autoencoder, the better the representation of the data, the richer the latent space. The space structure here is a large matrix of tensors that encode the weights of layers of the network.
Uses of Autoencoder
The Convolutional Autoencoder architecture is good for a lot of use cases. Some of these are explained below.
Reducing Complexity
The Encoder of the model works very well as a Dimensionality Reduction technique. For example, if we consider an image dataset, we can compress every image before feeding it to another model. This compression reduces the number of input values and thus makes the model less likely to be biased toward smaller details. The Autoencoder thus helps in improving the performance of a second model.
Anomaly Detection
An Autoencoder is generally used for reconstructing the base data using an Encoder-Bottleneck-Decoder architecture. Thus if the output reconstruction has a much larger error for a given sample, this sample could be an outlier. We can thus use the reconstruction error to find unusual data points in a dataset.
Unsupervised Learning
A convolutional Autoencoder’s information compression ability is a good start for an unsupervised learning problem. Using the Autoencoder as a Dimensionality Reduction technique allows the data to be clustered without any labels much more easily. This clustering may not be useful but can be a starting point for many other solutions.
What is a CNN?
A CNN is one of the foundation stones of Deep Learning that can take an input data sample and learn to recognize it given other such samples. A CNN generally comprises an input, multiple hidden layers, and an output. CNN’s are composed of individual neurons that respond to specific sets of stimuli and combine their responses to perform many tasks such as classification, clustering, segmentation, and others. CNNs are useful because they can understand spatial and temporal dependencies given large amounts of data by learning feature “filters”. Complex CNNs can model large amounts of data previously impossible for a computer to understand.
A simple CNN is shown below.
[IMAGE {2} { CNN } START SAMPLE]
[IMAGE {2} FINISH SAMPLE]
This article elaborates on a CNN-based architecture called the convolutional Autoencoder.
Implementation of an Autoencoder
The Convolutional Autoencoder has a few hyper-parameters that must be tweaked before training. The section below discusses these hyper-parameters, the Reconstruction loss used, and the general implementation of the architecture of a simple convolutional Autoencoder.
An example use case would be to re-create the MNIST dataset. The following image shows the input and output of an Autoencoder that was trained on such a task. The first row is the original input and the second row is the output of the Autoencoder. In this case, a perfect reconstruction is obtained.
[IMAGE {3} { Example } START SAMPLE]
[IMAGE {3} FINISH SAMPLE]
A complete demonstration of the code to re-create the MNIST dataset can be found on this link.
Code size
The code size is defined as how large the bottleneck layer is, consequently deciding to what extent the input data is compressed before being Decoded. The code size is the most important hyper-parameter to tune.
Number of Layers
The convolutional Autoencoder is similar to other types of networks in that the number of layers is a hyper-parameter. Note that increasing the number of layers increases the time to train a network. A higher model depth also increases inference time.
Number of Nodes per Layer
This hyper-parameter controls the weights per layer. As a general rule, the number of nodes decreases in the Encoder and increases in the Decoder.
Reconstruction loss
The loss function the convolutional Autoencoder is trained to minimize is called the reconstruction error. This loss function depends on the type of data we use. In the case of image-based data, the reconstruction loss can be either an MSE, an L1, or Binary Cross Entropy loss.
Encoder
The Encoder can be created with a stack of 2D Convolutions starting with a large size and slowly reducing in dimensions. Each of the convolutions is followed by a 2D Max-Pooling layer. The ReLU activation is used throughout. In Keras, the Encoder can look something like this.
x = layers.Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = layers.MaxPooling2D((2, 2), padding='same')(x)
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = layers.MaxPooling2D((2, 2), padding='same')(x)
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = layers.MaxPooling2D((2, 2), padding='same')(x)
Decoder
The Decoder comprises blocks of 2D convolutions followed by Up Sampling layers. This part of the network looks something like the following in Keras.
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = layers.UpSampling2D((2, 2))(x)
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = layers.UpSampling2D((2, 2))(x)
x = layers.Conv2D(16, (3, 3), activation='relu')(x)
x = layers.UpSampling2D((2, 2))(x)
decoded = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
::: :::section{.summary}
Conclusion
- Convolutional Autoencoders are a powerful unsupervised learning technique.
- The article explained the Encoder-Bottleneck-Decoder architecture in detail.
- The Reconstruction Loss obtained is a valuable classification and image generation resource.
- The article also explained multiple use cases such as Anomaly Detection, Complexity Reduction, and some others.
- This article also provided a template for implementing a Convolutional Autoencoder in Tensorflow. :::