DeepNorm

  • which modifies the residual connection in Transformers
  • theoretical justification of bounding the model update by a constant which makes stable training possible in a principled way
  • DeepNorm modifies the residual connection in the Transformer architecture by up-scaling it before performing Layer Normalization