How Does Batch Normalization Help Optimization?

Original paper · Santurkar et al 2019

A follow up to the original batch normalization paper, the authors suggest that BatchNorm’s effectiveness has little to do with the reduction of the so-called internal covariate shift. Instead they uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradient, allowing for faster training.