NADAM extension of Adam that uses the Nesterov momentum technique to accelerate the optimization convergence further combines the momentum of Nesterov’s method with the adaptive learning rates of Adam mt=β1∗m(t−1)+(1−β1)∗gt vt=β2∗v(t−1)+(1−β2)∗gt2 mt^=1−β1tmt vt^=1−β2tvt parameter=parameter−learning_rate∗vt^+ϵmt^+(1−β1)∗gt Where beta1 and beta2 are two hyperparameters, m_t and v_t are moving averages of the gradients, g_t is the gradient at time t, and learning_rate; epsilon is the same as before.