Subhaditya's KB

❯

❯

❯

Machine Learning

❯

❯

FLASH

Sep 18, 20241 min read

architecture

FLASH

Transformer Quality in Linear Time
weaknesses in handling long sequences
FLASH
performant layer (gated linear unit) and by combining it with an accelerator-efficient approximation strategy (Mixed chunk attention)
GAU
Mixed chunk attention
outperforms three baselines: vanilla Transformer, Performer and Combiner in terms of quality and efficiency
Wiki
PG-19

Graph View

Backlinks

_Index_of_Models
architecture

Created with Quartz v4.3.1 © 2025

GitHub