Long Short Term Memory (LSTM)

  • Smaller chance of exploding or vanishingarchitecture
  • Better ability to model long term dependencies
  • Gated connections
  • Gates that learn to forget some aspects, and remember others better
  • Splitting state into parts output pred and Feature Learning
  • At the end of the day, these could not handle too long sequences. Therefore Transformer

The Math

  • Gates
    • Forget
      • How much of the previous cell state is used
    • Input
      • How proposal is added to the state
    • Output
      • Component wise products
  • Hidden state
    • to model cross timestep dependencies
      • Cell state proposal :
      • Final cell state :
    • to predict output