ConvBERT

  • Convolutional BERT (ConvBERT) improves the original BERT by replacing some Multi Head Attention Self Attention segments with cheaper and naturally local operations, so-called span-based dynamic convolutions. These are integrated into the self-Attention mechanism to form a mixed Attention mechanism, allowing Multi-headed Self-Attention to capture global patterns; the Convolutions focus more on the local patterns, which are otherwise captured anyway. In other words, they reduce the computational intensity of training BERT.