Transformer-XL

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling
learning dependency beyond a fixed length without disrupting temporal coherence
segment-level recurrence mechanism and a novel positional encoding scheme
resolves the context fragmentation problem
enwiki8
WikiText
One Billion Word
Penn Treebank

Subhaditya's KB