Abstract
To analyze deep ReLU network, we adopt a student-teacher setting in which an
over-parameterized student network learns from the output of a fixed teacher
network of the same depth, with Stochastic Gradient Descent (SGD). Our
contributions are two-fold. First, we prove that when the gradient is small at
every training sample, student node specializes to teacher nodes in the
lowest layer under mild conditions. Second, analysis of noisy recovery and
training dynamics in 2-layer network shows that strong teacher nodes (with
large fan-out weights) are learned first and subtle teacher nodes are left
unlearned until late stage of training. As a result, it could take a long time
to converge into these small-gradient critical points. Our analysis shows that
over-parameterization is a necessary condition for specialization to happen at
the critical points, and helps student nodes cover more teacher nodes with
fewer iterations. Both improve generalization. Different from Neural Tangent
Kernel and statistical mechanics approach, our approach operates on finite
width, mild over-parameterization (as long as there are more student nodes than
teacher) and finite input dimension. Experiments justify our finding. The code
is released in https://github.com/facebookresearch/luckmatters.
Users
Please
log in to take part in the discussion (add own reviews or comments).