FTSwish

  • Relu + Sigmoid
  • \begin{equation} FTSwish: f(x) = \begin{cases} T, & \text{if}\ x < 0 \\ \frac{x}{1 + e^{-x}} + T, & \text{otherwise} \\ \end{cases} \end{equation}
  • As we can see, the Sparsity principle is still true - the neurons that produce negative values are taken out.
  • What we also see is that the derivative of FTSwish is smooth, which is what made Swish theoretically better than Relu in terms of the loss landscape
  • However, what I must note is that this function does not protect us from the dying Relu problem: the gradients for are zero, as with Relu.