Subhaditya's KB

Home

❯

KB

❯

AI

❯

Machine Learning

❯

Models

❯

FLASH

FLASH

Sep 18, 20241 min read

  • architecture

FLASH

  • Transformer Quality in Linear Time
  • weaknesses in handling long sequences
  • FLASH
  • performant layer (gated linear unit) and by combining it with an accelerator-efficient approximation strategy (Mixed chunk attention)
  • GAU
  • Mixed chunk attention
  • outperforms three baselines: vanilla Transformer, Performer and Combiner in terms of quality and efficiency
  • Wiki
  • PG-19

Graph View

Backlinks

  • _Index_of_Models
  • architecture

Created with Quartz v4.3.1 © 2025

  • GitHub