Label Smoothing

  • Dense layer is generally the last one and combined with soft max leads to a Probability distribution
  • Assume true label to be y, then a truth Probability distribution would be If i=y and 0 otherwise
  • During training, minimize negative Cross Entropy loss to make these Distributions similar
  • We know,
  • Where the optimal solution is
    • The output scores are encouraged to be distinctive which leads to overfitting
    • Leads to
  • Instead \cases{1-\epsilon& if i=1\\\frac{\epsilon}{(K-1)} & \text{otherwise}}
  • The optimal Solution is
    • if
    • otherwise
      • Any real number
      • Finite output from the last layer that generalizes well
  • If , is
  • As increases, the gap decreases
  • If , all optimizal are identical