He Initialization
- bring the variance of those outputs to approximately one
- However, Kumar indeed proves mathematically that for the Relu activation function, the best weight Initialization strategy is to initialize the weights randomly but with this variance:
- \begin{equation} v^{2} = 2/N \end{equation}
- For Sigmoid based activation functions