Abstract
Batch Normalization (BatchNorm) is a widely adopted technique that enables
faster and more stable training of deep neural networks (DNNs). Despite its
pervasiveness, the exact reasons for BatchNorm&\#39;s effectiveness are still poorly
understood. The popular belief is that this effectiveness stems from
controlling the change of the layers&\#39; input distributions during training to
reduce the so-called &\#34;internal covariate shift&\#34;. In this work, we demonstrate
that such distributional stability of layer inputs has little to do with the
success of BatchNorm. Instead, we uncover a more fundamental impact of
BatchNorm on the training process: it makes the optimization landscape
significantly smoother. This smoothness induces a more predictive and stable
behavior of the gradients, allowing for faster training.
Users
Please
log in to take part in the discussion (add own reviews or comments).