MLM ([Masked Language Modeling](Masked Language Modeling.md))
- from
- 15% of the words in each sequence are replaced by
[MASK]
- model tries to predict original values of the masked words
- uses the context provided by the other non-masked words in the sequences
- loss function only considers the predictions of the masked words, ignores non-masked ones
- leads to slower convergence than with directional models
- additions to standard architecture:
- classification layer on top of the encoder output
- multiplying the encoders output vectors with the embedding matrix → transforms them into the vocabulary dimension
- calculating probability of each word in the vocabulary using Softmax