the distributed representation of the input word is used to predict the context.
tries to predict the neighbors of a word
works well with a small amount of the training data, represents well even rare words or phrases.
Skip-gram rely on single words input, it is less sensitive to overfit frequent words, because even if frequent words are presented more times that rare words during training, they still appear individually
tends to study different contexts separately
needs more data to be trained contains more knowledge about the context.
takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input.
A virtual [one hot](one hot.md) encoding of words goes through a ‘projection layer’ to the hidden layer; these projection weights are later interpreted as the word embeddings.
So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.
also uses [Negative Sampling](Negative Sampling.md)