Backprop

  • Gradient
    • partial derivs of the loss wrt weights Forward pass
      • Store result of operation in u
    • Backward Pass
      • Traverse the graph backwards
        • Chain Rule :
        • \begin{align} &\frac{d\hat y}{d\mathbf{W_1}}\\ &= \frac{\partial \hat y}{\partial u_2} \frac{\partial u_2}{\partial h_1} \frac{\partial h_1}{\partial u_1} \frac{\partial u_1}{\partial \mathbf{W_1}} \\ &= \frac{\partial \sigma (u2)}{\partial u_2} \frac{\partial \mathbf{W}^T_2 h_1}{\partial h_1} \frac{\partial \sigma (u1)}{\partial u_1} \frac{\partial \mathbf{W}^T_1 x}{\partial \mathbf{W}_1} \end{align}
        • Collecting all the wrt params architecture exponentially decreases wrt depth of the network : Vanishing