Pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
sample-efficient pre-training alternative task called replaced token detection
self-supervised task for language representation learning
Instead of masking the input, their approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network
Then, instead of training a model that predicts the original identities of the corrupted tokens, the key idea is training a discriminative text encoder model to distinguish input tokens from high-quality negative samples produced by an small generator network
more compute-efficient and results in better performance on downstream tasks