Linear Learning Rate Scaling If [He Initialization ] is used, 0.1 is a good learning rate for batch size 256 and for a larger b, 0.1×256b is okay