Transformer-XL
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling
- learning dependency beyond a fixed length without disrupting temporal coherence
- segment-level recurrence mechanism and a novel positional encoding scheme
- resolves the context fragmentation problem
- enwiki8
- WikiText
- One Billion Word
- Penn Treebank