tries to predict the current target word (the center word) based on the source context words (surrounding words)
“the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on
context window
several times faster to train than the skip-gram, slightly better accuracy for the frequent words.
CBOW is prone to overfit frequent words because they appear several time along with the same context.
tends to find the probability of a word occurring in a context
it generalizes over all the different contexts in which a word can be used
also a 1-hidden-layer neural network
The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word.
Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings.