Fixed Factorization Attention

paper
Specific cells summarize previous locations and propagate to all future cells.
Part of Sparse Transformer
Fixed Attention pattern with c = 1 limits expressivity
many representations in the network are only used for one block whereas a small number of locations are used by all blocks.
Choosing $c \in Strided Attention
when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock

Subhaditya's KB