Large Batch Training
- Generally slows down training
- If convex, convergence rate decreases with increase in batch size
- Learning Rate Scheduling
- Modified Batch Normalization with for all BNs at the end of a residual block that micmics networks with less Layers and is easier to train at the start
- No bias decay