Gradient Clipping

  • Limit the value or the norm of a gradient to a fixed Hyperparameter λ.
  • mitigate the Vanishing & Exploding Gradients, exploding ones
  • idea is to CLIP the gradients during Backpropagation to a certain threshold (limit the value)
  • most often used in RNN or GAN, where Batch Normalisation is tricky to use
  • methods
    • CLIP by norm
      • CLIP the whole gradient if its L2 norm is greater than the threshold
      • remains the orientation
    • CLIP by value
      • CLIP the gradient by a fixed value
      • problem: orientation of the gradient may change due to clipping
        • example: [0.9,100.0]→[0.9,1.0]
        • however, this works well in practice
  • pros:
    • larger batch sizes
  • cons:
    • sensible to tuning Hyperparameter λ
  • Adaptive Gradient Clipping