Abstract
The stochastic gradient descent (SGD) method and its variants are algorithms
of choice for many Deep Learning tasks. These methods operate in a small-batch
regime wherein a fraction of the training data, say $32$-$512$ data points, is
sampled to compute an approximation to the gradient. It has been observed in
practice that when using a larger batch there is a degradation in the quality
of the model, as measured by its ability to generalize. We investigate the
cause for this generalization drop in the large-batch regime and present
numerical evidence that supports the view that large-batch methods tend to
converge to sharp minimizers of the training and testing functions - and as is
well known, sharp minima lead to poorer generalization. In contrast,
small-batch methods consistently converge to flat minimizers, and our
experiments support a commonly held view that this is due to the inherent noise
in the gradient estimation. We discuss several strategies to attempt to help
large-batch methods eliminate this generalization gap.
Users
Please
log in to take part in the discussion (add own reviews or comments).