Gradient Accumulation
- Pytorch
- helps when the model is not able to be trained with a big enough batch size
- often caused by memory limitations of the GPU
- Accumulate the gradients (for each trainable model value) of several forward passes and after some steps use the accumulated gradients to update the weights
- Is then equal to using a large batch size
- example with