DeepNet
- DeepNet: Scaling Transformers to 1,000 Layers
- allows train extremely deep transformers with 1000L+ Layers
- fundamental, effective and simple
- can be used in any Transformer architecture (encoder, decoder, encoder-decoder) which covers almost all different tasks across AI areas (language, vision, speech, multimodal, and beyond)
- newly proposed normalization function
- DeepNorm
- It works alongside a dedicated Initialization scheme based on Xavier Initialization.
- These two tricks lead to greater stability during the training which allows the authors to scale their modified Transformer architecture (DeepNet) up to 1000 Layers