On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, и P. Tang.
(2016)cite arxiv:1609.04836Comment: Accepted as a conference paper at ICLR 2017.

Аннотация

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

ключ BibTeX: keskar2016largebatch
тип записи: article
год: 2016
url: http://arxiv.org/abs/1609.04836
Примечание: cite arxiv:1609.04836Comment: Accepted as a conference paper at ICLR 2017

тэги

Пользователи данного ресурса

Комментарии и рецензиипоказать / перейти в невидимый режим

Пожалуйста, войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)

BibSonomy