Abstract
It is common practice to decay the learning rate. Here we show one can
usually obtain the same learning curve on both training and test sets by
instead increasing the batch size during training. This procedure is successful
for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum,
and Adam. It reaches equivalent test accuracies after the same number of
training epochs, but with fewer parameter updates, leading to greater
parallelism and shorter training times. We can further reduce the number of
parameter updates by increasing the learning rate $\epsilon$ and scaling the
batch size $B \epsilon$. Finally, one can increase the momentum
coefficient $m$ and scale $B 1/(1-m)$, although this tends to slightly
reduce the test accuracy. Crucially, our techniques allow us to repurpose
existing training schedules for large batch training with no hyper-parameter
tuning. We train Inception-ResNet-V2 on ImageNet to $77\%$ validation accuracy
in under 2500 parameter updates, efficiently utilizing training batches of
65536 images.
Users
Please
log in to take part in the discussion (add own reviews or comments).