Multi Head Attention

ZihangDai et al., 2019
which computes self-Attention over the inputs, then adds back the residual and layer normalizes everything. The Attention head can be split into multiple segments, hence the name multi-head
Multiple Attention instances, each focusing on a different part of the input
Words can mean different things in context
- If using Self Attention, then this just gets summed up. Which is not very nice
- Several Attention heads → different output vectors
- Concatenate them and pass through a linear transform → dimension back to k
$M u lt i He a d (Q, K, V) = C o n c a t (h e a d_{1}, h e a d_{2}, \dots ., h e a d_{h}) W^{O}$
- $h e a d_{i} = A tt e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$
W is learnable projections for Attention params
To improve efficiency
- Cut the incoming vector into chunks → no of Attention heads

class MultiHeadAttentionNew(nn.Module):
    def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
        super().__init__()
        self.n_head = n_head
        
        self.w_qs = nn.Linear(d_model, n_head * d_k)
        self.w_ks = nn.Linear(d_model, n_head * d_k)
        self.w_vs = nn.Linear(d_model, n_head * d_v)
        
        nn.init.normal_(self.w_qs.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_k)))
        nn.init.normal_(self.w_ks.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_k)))
        nn.init.normal_(self.w_vs.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_v)))
        
        self.fc = nn.Linear(n_head * d_v, d_model)
        nn.init.xavier_normal_(self.fc.weight)
        self.dropout = nn.Dropout(p=dropout)
        self.layer_norm = nn.LayerNorm(d_model)
 
    def forward(self, q, k, v, mask=None):
        residual = q
        q = rearrange(self.w_qs(q), 'b l (head k) -> head b l k', head=self.n_head)
        k = rearrange(self.w_ks(k), 'b t (head k) -> head b t k', head=self.n_head)
        v = rearrange(self.w_vs(v), 'b t (head v) -> head b t v', head=self.n_head)
        attn = torch.einsum('hblk,hbtk->hblt', [q, k]) / np.sqrt(q.shape[-1])
        if mask is not None:
            attn = attn.masked_fill(mask[None], -np.inf)
        attn = torch.softmax(attn, dim=3)
        output = torch.einsum('hblt,hbtv->hblv', [attn, v])
        output = rearrange(output, 'head b l v -> b l (head v)')
        output = self.dropout(self.fc(output))
        output = self.layer_norm(output + residual)
        return output, attn

Subhaditya's KB

Multi Head Attention

Multi Head Attention

Graph View

Backlinks