DeepNorm
- which modifies the residual connection in Transformers
- theoretical justification of bounding the model update by a constant which makes stable training possible in a principled way
- DeepNorm modifies the residual connection in the Transformer architecture by up-scaling it before performing Layer Normalization