XLM-R

  • Unsupervised Cross-lingual Representation Learning at Scale
  • pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks
  • Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data
  • significantly outperforms multilingual BERT
  • low-resource languages
  • positive transfer and capacity dilution
  • performance of high and low resource languages at scale
  • possibility of multilingual modeling without sacrificing per-language performance