gated Attention unit; a generalization of GLU - gated linear unit
allows for better and more efficient approximation of multi-head Attention than many other efficient Attention methods by using a weaker single-head Attention with minimal quality loss