DeepNet

DeepNet: Scaling Transformers to 1,000 Layers
allows train extremely deep transformers with 1000L+ Layers
fundamental, effective and simple
can be used in any Transformer architecture (encoder, decoder, encoder-decoder) which covers almost all different tasks across AI areas (language, vision, speech, multimodal, and beyond)
newly proposed normalization function
DeepNorm
It works alongside a dedicated Initialization scheme based on Xavier Initialization.
These two tricks lead to greater stability during the training which allows the authors to scale their modified Transformer architecture (DeepNet) up to 1000 Layers

Subhaditya's KB