Factorized Embedding Parameters

  • Factorization of these parameters is achieved by taking the matrix representing the weights of the word embeddings and decomposing it into two different matrices. Instead of projecting the one-hot encoded vectors directly onto the hidden space, they are first projected on some-kind of lower-dimensional Embedding space, which is then projected to the hidden space (Lan et al, 2019). Normally, this should not produce a different result, but let’s wait.
  • Another thing that actually ensures that this change reduces the number of parameters is that the authors suggest to reduce the size of the Embedding matrix.
  • In BERT, the shape of the vocabulary/Embedding matrix E equals that of the matrix for the hidden state H.
  • First of all, theoretically, the matrix E captures context-independent information
  • whereas the hidden representation H captures context-dependent information
  • ALBERT solves this issue by decomposing the Embedding parameters into two smaller matrices, allowing a two-step mapping between the original Word Vectors and the space of the hidden state. In terms of computational cost, this no longer means but rather , which brings a significant reduction when .