Mixed Chunk Attention

  • an efficient linear approximation method that combines the benefits from partial and linear Attention mechanisms, which is accelerator-friendly and highly competitive in quality.
  • The method works on chunks of tokens and leverages local (within chunk) and global (between chunks) Attention spans