He Initialization

  • bring the variance of those outputs to approximately one
  • However, Kumar indeed proves mathematically that for the Relu activation function, the best weight Initialization strategy is to initialize the weights randomly but with this variance:
    • \begin{equation} v^{2} = 2/N \end{equation}
  • For Sigmoid based activation functions