copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Backward Feature Correction: How Deep Learning Performs Deep Learning

Z. Allen-Zhu, and Y. Li. (2020)cite arxiv:2001.04413.

Abstract

How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of $hierarchical learning$. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD). On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how very deep neural networks can still be sample and time efficient on certain hierarchical learning tasks, when NO KNOWN non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding) are efficient. We establish a new principle called "backward feature correction", which we believe is the key to understand the hierarchical learning in multi-layer neural networks. On the technical side, we show for regression and even for binary classification, for every input dimension $d > 0$, there is a concept class consisting of degree $ømega(1)$ multi-variate polynomials so that, using $ømega(1)$-layer neural networks as learners, SGD can learn any target function from this class in $poly(d)$ time using $poly(d)$ samples to any $1poly(d)$ error, through learning to represent it as a composition of $ømega(1)$ layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from $d^ømega(1)$ sample or time complexity to learn functions in this concept class even to any $d^-0.01$ error.

Description

[2001.04413] Backward Feature Correction: How Deep Learning Performs Deep Learning

Links and resources

BibTeX key: allenzhu2020backward
entry type: article
year: 2020
url: http://arxiv.org/abs/2001.04413
note: cite arxiv:2001.04413

Cite this publication

%0 Journal Article %1 allenzhu2020backward %A Allen-Zhu, Zeyuan %A Li, Yuanzhi %D 2020 %K generalization learning optimization readings stochastic %T Backward Feature Correction: How Deep Learning Performs Deep Learning %U http://arxiv.org/abs/2001.04413 %X How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of $hierarchical learning$. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD). On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how very deep neural networks can still be sample and time efficient on certain hierarchical learning tasks, when NO KNOWN non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding) are efficient. We establish a new principle called "backward feature correction", which we believe is the key to understand the hierarchical learning in multi-layer neural networks. On the technical side, we show for regression and even for binary classification, for every input dimension $d > 0$, there is a concept class consisting of degree $ømega(1)$ multi-variate polynomials so that, using $ømega(1)$-layer neural networks as learners, SGD can learn any target function from this class in $poly(d)$ time using $poly(d)$ samples to any $1poly(d)$ error, through learning to represent it as a composition of $ømega(1)$ layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from $d^ømega(1)$ sample or time complexity to learn functions in this concept class even to any $d^-0.01$ error.

@article{allenzhu2020backward, abstract = {How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of $\textit{hierarchical learning}$. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD). On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how very deep neural networks can still be sample and time efficient on certain hierarchical learning tasks, when NO KNOWN non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding) are efficient. We establish a new principle called "backward feature correction", which we believe is the key to understand the hierarchical learning in multi-layer neural networks. On the technical side, we show for regression and even for binary classification, for every input dimension $d > 0$, there is a concept class consisting of degree $\omega(1)$ multi-variate polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any target function from this class in $\mathsf{poly}(d)$ time using $\mathsf{poly}(d)$ samples to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from $d^{\omega(1)}$ sample or time complexity to learn functions in this concept class even to any $d^{-0.01}$ error.}, added-at = {2020-01-14T20:21:47.000+0100}, author = {Allen-Zhu, Zeyuan and Li, Yuanzhi}, biburl = {https://www.bibsonomy.org/bibtex/274544f53ce8471d6b26b07221555aaa3/kirk86}, description = {[2001.04413] Backward Feature Correction: How Deep Learning Performs Deep Learning}, interhash = {4ea126bf5402325bd73030eb4aa7b49a}, intrahash = {74544f53ce8471d6b26b07221555aaa3}, keywords = {generalization learning optimization readings stochastic}, note = {cite arxiv:2001.04413}, timestamp = {2020-01-14T20:21:47.000+0100}, title = {Backward Feature Correction: How Deep Learning Performs Deep Learning}, url = {http://arxiv.org/abs/2001.04413}, year = 2020 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Backward Feature Correction: How Deep Learning Performs Deep Learning

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Backward Feature Correction: How Deep Learning Performs Deep Learning

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Backward Feature Correction: How Deep Learning Performs Deep Learning

Comments and Reviews
(0)