Vanishing Gradient

  • Deltas become smaller initially. using [Sigmoid] [ill conditioning](Sigmoid] [ill conditioning.md)
  • Saturation and prevent Backprop
  • Weight matrices are usually initialized with random values
    • gradient magnitueds decay exponentially max eigenvalue