DeepNet

  • DeepNet: Scaling Transformers to 1,000 Layers
  • allows train extremely deep transformers with 1000L+ Layers
  • fundamental, effective and simple
  • can be used in any Transformer architecture (encoder, decoder, encoder-decoder) which covers almost all different tasks across AI areas (language, vision, speech, multimodal, and beyond)
  • newly proposed normalization function
  • DeepNorm
  • It works alongside a dedicated Initialization scheme based on Xavier Initialization.
  • These two tricks lead to greater stability during the training which allows the authors to scale their modified Transformer architecture (DeepNet) up to 1000 Layers