MLM ([Masked Language Modeling](Masked Language Modeling.md))

from
15% of the words in each sequence are replaced by [MASK]
model tries to predict original values of the masked words
uses the context provided by the other non-masked words in the sequences
loss function only considers the predictions of the masked words, ignores non-masked ones
- leads to slower convergence than with directional models
additions to standard architecture:
- classification layer on top of the encoder output
- multiplying the encoders output vectors with the embedding matrix → transforms them into the vocabulary dimension
- calculating probability of each word in the vocabulary using Softmax

Subhaditya's KB