Gradient Accumulation

  • Pytorch
  • helps when the model is not able to be trained with a big enough batch size
  • often caused by memory limitations of the GPU
  • Accumulate the gradients (for each trainable model value) of several forward passes and after some steps use the accumulated gradients to update the weights
  • Is then equal to using a large batch size
  • example with