copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang. (2016)cite arxiv:1609.04836Comment: Accepted as a conference paper at ICLR 2017.

Abstract

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

Description

[1609.04836] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Links and resources

BibTeX key: keskar2016largebatch
entry type: article
year: 2016
url: http://arxiv.org/abs/1609.04836
note: cite arxiv:1609.04836Comment: Accepted as a conference paper at ICLR 2017

@kirk86's tags highlighted

Cite this publication

search on

Meta data

Last update 5 years ago
Created 5 years ago

Comments and Reviews
(0)

There is no review or comment yet. You can write one!

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Comments and Reviews
(0)