evaluates a number of design decisions when pretraining BERT models
They find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.
performance can be substantially improved by training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data