Subhaditya's KB

❯

❯

Attention Alignment

Attention Alignment

Sep 18, 20241 min read

loss

Attention Alignment

If there are sequences $x, y$
- Encoder is any Recurrent with a forward state $h^{T}$ and $h^{T}$ for backward
- Concat them represents the preceding and following word annotations
  - $h_{i} = [h_{i}^{T}; h_{i}^{T}]$ , $i = 1, \dots, n$
  - Decoder has hidden state $s_{t} = f (s_{t - 1}, y_{t - 1}, c_{t})$ for the output word at position t for $t = 1, \dots, m$
    - Context vector $c_{t}$ is a sum of hidden states of the input seq, weighted by alignment scores
    - $c_{t} = Σ_{i = 1}^{n} α_{t, i} h_{i}$
    - How well the two words are aligned is given by
    - $α_{t, i} = a l i g n (y_{t}, x_{i})$
    - Taking Softmax
      - $\frac{e x p ( score ( s _{t - 1} , h _{i} ))}{Σ _{i^{'} - 1}^{n} e x p ( score ( s _{t - 1} , h _{i}^{'} ))}$
$f_{a tt} (h_{i}, s_{j}) = v_{a}^{T} t anh (W_{a} [h_{i}; s_{j}])$
$v_{a}$ and $W_{a}$ are the learned Attention params
$h$ is the hidden state for the encoder
$s$ is the hidden state for the decoder
Matrix of alignment
- Final scores calculated with a Softmax

Graph View

Backlinks

Additive Attention
Content Based Attention
Dot Product Attention
Location Base Attention
Scaled Dot Product Attention

Created with Quartz v4.3.1 © 2025

GitHub