allows train extremely deep transformers with 1000L+ Layers
fundamental, effective and simple
can be used in any Transformer architecture (encoder, decoder, encoder-decoder) which covers almost all different tasks across AI areas (language, vision, speech, multimodal, and beyond)
These two tricks lead to greater stability during the training which allows the authors to scale their modified Transformer architecture (DeepNet) up to 1000 Layers