Adam: A Method for Stochastic Optimization

Original paper ยท Kingma et al 2014

Adam is a algorithm for first-order gradient-based optimization used extensively in the deep learning field. Originally published in 2014, the paper combines the advantages of two recently published optimization methods - AdaGrad's ability to deal with sparse gradients and RMSProp's ability to deal with non-stationary objectives. The method is aimed towards machine learning problems with large datasets and/or high-dimensional parameter space. Given todays vast landscape of ML libraries,layering endless abstractions upon each other its very simple to use an optimizer like Adam without lifting a finger. Nonetheless I find it interesting and important to understand these algorithms, a developer without understanding of a fields history is like a tree without roots.

Adam computes individual adaptive learning rates for different parameters based on estimates of first and second moments of gradients. These moments represent moving averages or trends of the gradient. Two parameters, B1 and B2, control the exponential decay rates of these moving averages. Note, there is still a global learning rate, alpha, but this is simply used as a scaling factor to control how much Adam adjusts the individual learning rates. As the moving averages are zero-initialized, the moment estimates are biased towards zero, especially during initial timesteps. This is counteracted by bias-corrected first and second moments. Comparing this approach to RMSProp, we see that Adam updates are based on running averages of first and second moments of the gradients whereas RMSProp merely rescales the gradient. It also lacks a bias-correction term.

The authors perform an empirical evaluation of the proposed method using popular machine learning models - logistic regression, neural net and convolutional neural net. All experiments show a considerable improvement over previous SOTA methods both in terms of speed and performance. Overall, Adam has proven to be robust and well-suited to a wide range of non-convex optimization problems in the field of machine learning.