Learning Rate Warmup

  • Small learning rate at the start and then a larger learning rate when the training is stabilized
  • Linearly from 0 to initial rate
  • First m batches to warm up and if the initial learning rate is then at batch i, , learning rate is