SELU
- selu(x)=λx,if x>0,else ,α(ex−1)
- Paper: https://arxiv.org/pdf/1706.02515.pdf
- scaled variant of the Elu function
- does internal normalization (“self-normalizing”)
- each layer preserves the mean and the variance from the previous one
- normalization happens within the activation function
- to work:
- input features must be standardized
- architecture must be sequential
- self-normalizing not guaranteed otherwise
- SELU as activation
- custom Initialization
- if all Layers are dense (in paper), but other research showed that it also works for CNNs
- has two fixed parameters α and λ
- not hyperparameters nor learnt parameters
- derived from the inputs (μ=0, std=1)
- α≈1.6732, λ≈1.0507
- Pros:
- no Vanishingexploding gradients
- cannot die as Relu
- converges faster and to a better result than other activation functions
- significantly outperformed other activation functions for deep networks
- Cons: