Label Smoothing

Dense layer is generally the last one and combined with soft max leads to a Probability distribution
Assume true label to be y, then a truth Probability distribution would be $p_{i} = 1$ If i=y and 0 otherwise
During training, minimize negative Cross Entropy loss to make these Distributions similar
We know, $l (p, q) = - l o g p_{y} = - z_{y} + l o g (Σ_{i = 1}^{K} e x p (z_{i}))$
Where the optimal solution is $z_{y}^{*} = in f$
- The output scores are encouraged to be distinctive which leads to overfitting
- Leads to
Instead $\cases{1-\epsilon& if i=1\\\frac{\epsilon}{(K-1)} & \text{otherwise}}$
The optimal Solution is
- $l o g ((K - 1) (1 - ϵ) / ϵ) + α$ if $i = y$
- $α$ otherwise
  - Any real number
  - Finite output from the last layer that generalizes well
If $ϵ = 0$ , $l o g ((k - 1) \frac{1 - ϵ}{ϵ})$ is $\infty$
As $ϵ$ increases, the gap decreases
If $ϵ = \frac{K - 1}{K}$ , all optimizal $z_{i}^{*}$ are identical

Subhaditya's KB