NADAM

extension of Adam that uses the Nesterov momentum technique to accelerate the optimization convergence further
combines the momentum of Nesterov’s method with the adaptive learning rates of Adam
$m_{t} = β_{1} * m_{(t - 1)} + (1 - β_{1}) * g_{t}$ $v_{t} = β_{2} * v_{(t - 1)} + (1 - β_{2}) * g_{t} 2$ $\overset{m_{t}}{^} = \frac{m _{t}}{1 - β _{1} t}$ $\overset{v_{t}}{^} = \frac{v _{t}}{1 - β _{2} t}$ $parameter = parameter - l e a r nin g_r a t e * \frac{m _{t} ^ + ( 1 - β _{1} ) * g _{t}}{v _{t} ^ + ϵ}$
Where beta1 and beta2 are two hyperparameters, m_t and v_t are moving averages of the gradients, g_t is the gradient at time t, and learning_rate; epsilon is the same as before.

Subhaditya's KB