ELECTRA
- ELECTRA: Pre-training Text Encoders As Discriminators Rather Than Generators
- Pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
- sample-efficient pre-training alternative task called replaced token detection
- self-supervised task for language representation learning
- Instead of masking the input, their approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network
- Then, instead of training a model that predicts the original identities of the corrupted tokens, the key idea is training a discriminative text encoder model to distinguish input tokens from high-quality negative samples produced by an small generator network
- more compute-efficient and results in better performance on downstream tasks
- particularly strong for small models
- GLUE
- performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.