XLM-R
- Unsupervised Cross-lingual Representation Learning at Scale
- pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks
- Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data
- significantly outperforms multilingual BERT
- low-resource languages
- positive transfer and capacity dilution
- performance of high and low resource languages at scale
- possibility of multilingual modeling without sacrificing per-language performance