BART
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- Denoising Autoencoder
- pretraining sequence-to-sequence
- trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text
- generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder),
- finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token
- With BERT, random tokens are replaced with masks, and the document is encoded bidirectionally. Missing tokens are predicted independently, so BERT cannot easily be used for generation.
- With GPT, tokens are predicted auto-regressively (generation of a new token is conditioned on the prior tokens), meaning GPT can be used for generation.
- noising schemes to an input document and thus corrupts it by replacing spans of text with mask symbols
- effective when finetuned for text generation but also works well for comprehension tasks
- matches the performance of RoBERTa with comparable training resource
- GLUE
- SQuAD