SELU

Information

  • Paper: https://arxiv.org/pdf/1706.02515.pdf
  • scaled variant of the Elu function
  • does internal normalization (“self-normalizing”)
    • each layer preserves the mean and the variance from the previous one
    • normalization happens within the activation function
    • to work:
      • input features must be standardized
      • architecture must be sequential
        • self-normalizing not guaranteed otherwise
      • SELU as activation 
      • custom Initialization
      • if all Layers are dense (in paper), but other research showed that it also works for CNNs
  • has two fixed parameters α and λ
    • not hyperparameters nor learnt parameters
    • derived from the inputs (μ=0, std=1)
    • α≈1.6732, λ≈1.0507
  • Pros:
    • no Vanishingexploding gradients
    • cannot die as Relu
    • converges faster and to a better result than other activation functions
    • significantly outperformed other activation functions for deep networks
  • Cons:
    • Computational heavier