FP16 Training
- @micikeviciusMixedPrecisionTraining2018
- Reduced precision has a narrower range that might make the results more out of range and worsen the training progress
- Can store all parameters and activations in FP16 and then use that for gradients.
- Also copy to FP32 for parameter updates
- Multiply scalar to loss to align range of FP16