Backprop
- Gradient ∇l(θ)=[∂θ1∂l(θ),…,∂θL∂L(θ)]
- partial derivs of the loss wrt weights
Forward pass
- Store result of operation in u
- Backward Pass
- Traverse the graph backwards
- Chain Rule : dθidl=Σk∈parents(l)∂uk∂l∂θi∂uk
- \begin{align} &\frac{d\hat y}{d\mathbf{W_1}}\\ &= \frac{\partial \hat y}{\partial u_2} \frac{\partial u_2}{\partial h_1} \frac{\partial h_1}{\partial u_1} \frac{\partial u_1}{\partial \mathbf{W_1}} \\ &= \frac{\partial \sigma (u2)}{\partial u_2} \frac{\partial \mathbf{W}^T_2 h_1}{\partial h_1} \frac{\partial \sigma (u1)}{\partial u_1} \frac{\partial \mathbf{W}^T_1 x}{\partial \mathbf{W}_1} \end{align}
- Collecting all the ∂σ(ui) wrt params →architecture exponentially decreases wrt depth of the network : Vanishing