Strided Attention
- paper
- Sparse factorizations of the Attention matrix
- Reduce to
- Recompute Attention matrices to save memory
- Fast Attention kernels
- Works nicely for images, music etc with a periodic structure
- Otherwise with the Strided pattern , the spatial coordinates do not correlate with the positions the elements might be more relevant in the future