Abstract
Increasing the batch size is a popular way to speed up neural network
training, but beyond some critical batch size, larger batch sizes yield
diminishing returns. In this work, we study how the critical batch size changes
based on properties of the optimization algorithm, including acceleration and
preconditioning, through two different lenses: large scale experiments, and
analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate
that optimization algorithms that employ preconditioning, specifically Adam and
K-FAC, result in much larger critical batch sizes than stochastic gradient
descent with momentum. We also demonstrate that the NQM captures many of the
essential features of real neural network training, despite being drastically
simpler to work with. The NQM predicts our results with preconditioned
optimizers, previous results with accelerated gradient descent, and other
results around optimal learning rates and large batch training, making it a
useful tool to generate testable predictions about neural network optimization.
Description
[1907.04164] Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Links and resources
Tags
community