Factorization of these parameters is achieved by taking the matrix representing the weights of the word embeddings E and decomposing it into two different matrices. Instead of projecting the one-hot encoded vectors directly onto the hidden space, they are first projected on some-kind of lower-dimensional Embedding space, which is then projected to the hidden space (Lan et al, 2019). Normally, this should not produce a different result, but let’s wait.
Another thing that actually ensures that this change reduces the number of parameters is that the authors suggest to reduce the size of the Embedding matrix.
In BERT, the shape of the vocabulary/Embedding matrix E equals that of the matrix for the hidden state H.
First of all, theoretically, the matrix E captures context-independent information
whereas the hidden representation H captures context-dependent information
ALBERT solves this issue by decomposing the Embedding parameters into two smaller matrices, allowing a two-step mapping between the original Word Vectors and the space of the hidden state. In terms of computational cost, this no longer means O(VxH) but rather O(VxE + ExH), which brings a significant reduction when H >> E.