NADAM

  • extension of Adam that uses the Nesterov momentum technique to accelerate the optimization convergence further
  • combines the momentum of Nesterov’s method with the adaptive learning rates of Adam
  • Where beta1 and beta2 are two hyperparameters, m_t and v_t are moving averages of the gradients, g_t is the gradient at time t, and learning_rate; epsilon is the same as before.