FLASH
- Transformer Quality in Linear Time
- weaknesses in handling long sequences
- FLASH
- performant layer (gated linear unit) and by combining it with an accelerator-efficient approximation strategy (Mixed chunk attention)
- GAU
- Mixed chunk attention
- outperforms three baselines: vanilla Transformer, Performer and Combiner in terms of quality and efficiency
- Wiki
- PG-19