SELU

Information

Paper: https://arxiv.org/pdf/1706.02515.pdf
scaled variant of the Elu function
does internal normalization (“self-normalizing”)
- each layer preserves the mean and the variance from the previous one
- normalization happens within the activation function
- to work:
  - input features must be standardized
  - architecture must be sequential
    - self-normalizing not guaranteed otherwise
  - SELU as activation
  - custom Initialization
    - zero mean
    - Standard Deviation: $\frac{1}{# in p u t s}$
  - if all Layers are dense (in paper), but other research showed that it also works for CNNs
has two fixed parameters α and λ
- not hyperparameters nor learnt parameters
- derived from the inputs (μ=0, std=1)
- α≈1.6732, λ≈1.0507
Pros:
- no Vanishingexploding gradients
- cannot die as Relu
- converges faster and to a better result than other activation functions
- significantly outperformed other activation functions for deep networks
Cons:
- Computational heavier