Scaled Dot Product Attention

Vaswani et al., 2017
Q is query, K is key V is value. Same dims
$q_{i} = W_{q} x_{i}$ , $k_{i} = W_{k} x_{i}$ , $v_{i} = W_{v} x_{i}$
- $w_{ij}^{'} = q_{i}^{T} k_{j}$
- $y_{i} = Σ_{j} w_{ij} v_{j}$
Softmax is sensitive to large values. Which sucks for thearchitecture
The avg value of the dot product grows with Embedding dimension k. So scale back.
- $k$ . Vector in $R^{k}$ with all values as c
- Euclidean length is $k c$
$A tt e n t i o n (Q, K, V) = so f t ma x (\frac{Q K ^{T}}{d _{k}}) V$
Generalization of Soft Attention
Attention Alignment score $α_{t, i} = \frac{s _{t}^{T} h _{i}}{n}$

Subhaditya's KB