Amsgrad

Modified Adam

By analyzing the proof of convergence for the Adam optimizer, they spotted a mistake in the update rule that could cause the algorithm to converge to a sub-optimal point
update rule of Adam

avg_grads = beta1 * avg_grads + (1-beta1) * w.grad 
avg_squared = beta2 * (avg_squared) + (1-beta2) * (w.grad ** 2) 
w = w - lr * avg_grads / sqrt(avg_squared)

We’ve just skipped the bias correction (useful for the beginning of training) to focus on the important point
The error in the proof of Adam the authors spotted is that it requires the quantity

lr / sqrt(avg_squared)

which is the step we take in the direction of our average gradients, to be decreasing over training
Since the learning rate is often taken constant or decreasing (except for crazy people like us trying to obtain super-convergence), the fix the authors proposed was to force the avg_squared quantity to be increasing by adding another variable to keep track of their maximums.
Implementing amsgrad
This causes the weight update code from the previous section to be changed to something like this:

Results

Amsgrad turns out to be very disappointing. In none of our experiments did we find that it helped the slightest bit, and even if it’s true that the minimum found by amsgrad is sometimes slightly lower (in terms of loss) than the one reached by Adam, the metrics (accuracy, f1 score…) always end up worse
The proof of convergence for the Adam optimizer in deep learning (since it’s for convex problems) and the mistake they found in it mattered for synthetic experiments that have nothing to do with real-
life problems. Actual tests show that when those avg_squared gradients want to decrease, it’s best for the final result to do so.

Subhaditya's KB

Amsgrad

Amsgrad

Modified Adam

Results

Graph View

Table of Contents

Backlinks