Abstract
The local geometry of high dimensional neural network loss landscapes can
both challenge our cherished theoretical intuitions as well as dramatically
impact the practical success of neural network training. Indeed recent works
have observed 4 striking local properties of neural loss landscapes on
classification tasks: (1) the landscape exhibits exactly $C$ directions of high
positive curvature, where $C$ is the number of classes; (2) gradient directions
are largely confined to this extremely low dimensional subspace of positive
Hessian curvature, leaving the vast majority of directions in weight space
unexplored; (3) gradient descent transiently explores intermediate regions of
higher positive curvature before eventually finding flatter minima; (4)
training can be successful even when confined to low dimensional random
affine hyperplanes, as long as these hyperplanes intersect a Goldilocks zone of
higher than average curvature. We develop a simple theoretical model of
gradients and Hessians, justified by numerical experiments on architectures and
datasets used in practice, that simultaneously accounts for all $4$ of
these surprising and seemingly unrelated properties. Our unified model provides
conceptual insights into the emergence of these properties and makes
connections with diverse topics in neural networks, random matrix theory, and
spin glasses, including the neural tangent kernel, BBP phase transitions, and
Derrida's random energy model.
Users
Please
log in to take part in the discussion (add own reviews or comments).