Cosine Learning Rate Decay Instead of Learning Rate Warmup and then decay ηt=21(1+cos(Ttπ))η Rate decreases slowly at first, then almost linear in the middle and slows down again in the end