Scaling down the residuals before adding them with the residual connection stabilized the training (factor: 0.1−0.3)
General Ideas
Parallel convolutions: Similar to the GoogLeNet architecture within their modules the authors simultaneously use multiple convolutional branches with different Receptive field sizes on the same input activation maps and again concatenate those activations for further processing.
Reduction modules: Instead of simply applying a single max pooling or a 2-stride convolution to downsize the spatial dimensions, the authors dedicated whole modules to this task again employing parallel branches.
Strong usage of small convolutional kernels(e.g. 3×3): Throughout the network the authors pefer smaller convolutional kernel size over larger ones, as this enables the same Receptive field with less parameters (e.g. a single 5×5convolution Receptive field as 2 consecutive 3×3convolutions [∼18 params ], but the later has less parameters)
Factorization of convolutions: They factorize convolutions of filter size n×n to a combination of 1×n and n×1 convolutions, in order to reduce the nr of parameters even further (e.g. 7×7 [∼49 params ] results in 1×7 and 7×1 [∼14 params ]!)
Residual connections: In the Inception-ResNet-v1 and Inception-ResNet-v2 the authors employ the usage of residual connections. Although the residual version of the networks converge faster, the final accuracy seems to mainly depend on the model size.
Usage of bottleneck layers: In order to reduce the cost of the individual convolutional branches within their modules, they apply 1×1convolutions at the beginning to reduce the depth of the input activation maps.
Remarks
Authors disagree with residual paper one some points
Residual connections are nessecary for training deep convolutional models
They show that it is not hard to train very deep models which achieve high performance without residual connections
They argue that residual connections do only speed up the training greatly
“Warm up” phases (pre-training with very low LR followed by a high LR) do not help to stabilize training very deep networks
high LR had the chance to destroy already learnt features