Classifying a Specific Image Region Using Convolutional Nets with an ROI Mask as Input
Eppel, Sagi. “Classifying a Specific Image Region Using Convolutional Nets with an ROI Mask as Input,” n.d., 8.
Intro
In some cases, it is desirable to classify only a specific region of the image that corresponds to a certain object.
Hence, assuming that the region of the object in the image is known in advance and is given as a binary region of interest (ROI) mask, the goal is to classify the object in this region using a convolutional neural net.
This goal is achieved using a standard image classification net with the addition of a side branch, which converts the ROI mask into an attention map. This map is then combined with the image classification net
focus the attention on the object region while still extracting contextual cues from the background
combining the attention map at the first layer of the net gave better results than combining it at higher layers of the net
An alternative approach is to generate an attention map, which can be used by the net to extract features from both objects and the background using the ROI mask as an additional input to the net
An attention map can easily be generated from the input ROI mask using a convolution layer
This attention map is then combined with one or more layers of the main branch, either by element-wise addition or multiplication
The combined layer is then used as an input for the next layer of the main branch
In order to allow element-wise addition or multiplication, the attention map must be the same size as the layer with which it is combined. To achieve this, the ROI mask was first resized to match the size of the layer with which it was merged, and a convolution layer was then applied with the same number of filters as the depth of the target layer.
For cases where the attention maps were combined with more than one layer , a separate attention map was generated using different convolution filters for each layer
Net Initiation
The convolution layer of the side branch was initialized as follows: if the attention map was to be merged by element-wise addition, both the weights and the bias were initialized to zero; if the attention map was to be merged multiplication, the bias was set to one and the filter weights to zero
This weights initiation method promise that the initial effect of the attention branch on the classification branch is zero at the outset and increases gradually during training.
Datasets
The nets were also trained using the OpenSurfaces material classification dataset1 0 ; in this case, the ROI was generated by taking a connected region of the image corresponding to a single material, and the output was the material type.
Results
It can be seen that methods based on generating an attention map and combining it with the main branch net branch gave considerably better accuracy than hard attention methods based on blacking out the background region3
The difference in accuracy is particularly large for the classification of small segments where background information is more important in classification.
Merging the attention map with the first layer of the net gave significantly better results than merging at higher layers
This probably due to the fact that higher layers of the net suffer from a loss of high-resolution information that is relevant in the classification of small objects.
Generating several attention maps and merging them with multiple layers of the net gave the same or worse results than generating a single attention map and merging it with the first layer