Learning Rate Scheduling Learning Rate Decay tricks Gradient Descent Increasing the batch size, reduces noise in thearchitecture so a larger learning rate is okay Linear Learning Rate Scaling Learning Rate Warmup