Label Smoothing
- Dense layer is generally the last one and combined with soft max leads to a Probability distribution
- Assume true label to be y, then a truth Probability distribution would be pi=1 If i=y and 0 otherwise
- During training, minimize negative Cross Entropy loss to make these Distributions similar
- We know, l(p,q)=−logpy=−zy+log(Σi=1Kexp(zi))
- Where the optimal solution is zy∗=inf
- The output scores are encouraged to be distinctive which leads to overfitting
- Leads to
- Instead \cases{1-\epsilon& if i=1\\\frac{\epsilon}{(K-1)} & \text{otherwise}}
- The optimal Solution is
- log((K−1)(1−ϵ)/ϵ)+α if i=y
- α otherwise
- Any real number
- Finite output from the last layer that generalizes well
- If ϵ=0 , log((k−1)ϵ1−ϵ) is ∞
- As ϵ increases, the gap decreases
- If ϵ=KK−1, all optimizal zi∗ are identical