Transformer models are good at capturing content-based global interactions, while CNNs exploit local Features effectively
integrating components from both CNNs and Transformers for end-to-end [speech recognition](speech recognition.md) to model both local and global dependencies of an audio sequence in a parameter-efficient way
importance of each component, and demonstrated that the inclusion of convolution modules is critical to the performance of the Conformer model
propose the convolution-augmented Transformer for [speech recognition](speech recognition.md), named Conformer