Adaptive Gradient Clipping

  • clips gradients to the ratio between weight gradient and weight value
  • Clipping parameter is more robust than in traditional GC
  • Swapping Batch Normalisation for AGC
    • faster training for equally sized models
    • Allows for even larger batch size training