Scaled Dot Product Attention

  • Vaswani et al., 2017
  • Q is query, K is key V is value. Same dims
  • , ,
  • Softmax is sensitive to large values. Which sucks for thearchitecture
  • The avg value of the dot product grows with Embedding dimension k. So scale back.
    • . Vector in with all values as c
    • Euclidean length is
  • Generalization of Soft Attention
  • Attention Alignment score