Article,

Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension

Y. Tian.
(2019)cite arxiv:1909.13458.

Abstract

To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is small at every training sample, student node specializes to teacher nodes in the lowest layer under mild conditions. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are learned first and subtle teacher nodes are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-parameterization is a necessary condition for specialization to happen at the critical points, and helps student nodes cover more teacher nodes with fewer iterations. Both improve generalization. Different from Neural Tangent Kernel and statistical mechanics approach, our approach operates on finite width, mild over-parameterization (as long as there are more student nodes than teacher) and finite input dimension. Experiments justify our finding. The code is released in https://github.com/facebookresearch/luckmatters.

BibTeX key: tian2019student
entry type: article
year: 2019
url: http://arxiv.org/abs/1909.13458
note: cite arxiv:1909.13458

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{tian2019student, abstract = {To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is small at every training sample, student node \emph{specializes} to teacher nodes in the lowest layer under mild conditions. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are learned first and subtle teacher nodes are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-parameterization is a necessary condition for specialization to happen at the critical points, and helps student nodes cover more teacher nodes with fewer iterations. Both improve generalization. Different from Neural Tangent Kernel and statistical mechanics approach, our approach operates on finite width, mild over-parameterization (as long as there are more student nodes than teacher) and finite input dimension. Experiments justify our finding. The code is released in https://github.com/facebookresearch/luckmatters.}, added-at = {2019-11-27T14:13:59.000+0100}, author = {Tian, Yuandong}, biburl = {https://www.bibsonomy.org/bibtex/20102fff5fc6c57907a0d10da99d0421a/kirk86}, description = {[1909.13458] Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension}, interhash = {d6d523029aa643a294f5bf50b06c164b}, intrahash = {0102fff5fc6c57907a0d10da99d0421a}, keywords = {deep-learning generalization optimization readings}, note = {cite arxiv:1909.13458}, timestamp = {2019-11-27T14:13:59.000+0100}, title = {Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension}, url = {http://arxiv.org/abs/1909.13458}, year = 2019 }

BibSonomy

Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on