Learning Rate Warmup Small learning rate at the start and then a larger learning rate when the training is stabilized Linearly from 0 to initial rate First m batches to warm up and if the initial learning rate is η then at batch i, 1≤i≤m , learning rate is miη