The generalization error of random features regression: Precise asymptotics and double descent curve

S. Mei, и A. Montanari.
(2019)cite arxiv:1908.05355Comment: We added two sections in version 3. One section provides the precise asymptotics of the training error. The other section describes a Gaussian covariate model, which gives the same asymptotic test error as the random features model.

Аннотация

Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data. This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity increases, the generalization error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the generalization error is found in this overparametrized regime, often when the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates. In this paper we consider the problem of learning an unknown function over the $d$-dimensional sphere $S^d-1$, from $n$ i.i.d. samples $(x_i, y_i) S^d-1 R$, $i n$. We perform ridge regression on $N$ random features of the form $\sigma(\boldsymbol w_a^Tx)$, $a N$. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the generalization error, in the limit $N, n, d \to ınfty$ with $N/d$ and $n/d$ fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures.

ключ BibTeX: mei2019generalization
тип записи: article
год: 2019
url: http://arxiv.org/abs/1908.05355
Примечание: cite arxiv:1908.05355Comment: We added two sections in version 3. One section provides the precise asymptotics of the training error. The other section describes a Gaussian covariate model, which gives the same asymptotic test error as the random features model

тэги

Пользователи данного ресурса

Комментарии и рецензиипоказать / перейти в невидимый режим

Пожалуйста, войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)

Цитировать эту публикацию

%0 Journal Article %1 mei2019generalization %A Mei, Song %A Montanari, Andrea %D 2019 %K generalization interpolation readings %T The generalization error of random features regression: Precise asymptotics and double descent curve %U http://arxiv.org/abs/1908.05355 %X Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data. This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity increases, the generalization error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the generalization error is found in this overparametrized regime, often when the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates. In this paper we consider the problem of learning an unknown function over the $d$-dimensional sphere $S^d-1$, from $n$ i.i.d. samples $(x_i, y_i) S^d-1 R$, $i n$. We perform ridge regression on $N$ random features of the form $\sigma(\boldsymbol w_a^Tx)$, $a N$. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the generalization error, in the limit $N, n, d \to ınfty$ with $N/d$ and $n/d$ fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures.

@article{mei2019generalization, abstract = {Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data. This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity increases, the generalization error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the generalization error is found in this overparametrized regime, often when the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates. In this paper we consider the problem of learning an unknown function over the $d$-dimensional sphere $\mathbb S^{d-1}$, from $n$ i.i.d. samples $(\boldsymbol x_i, y_i) \in \mathbb S^{d-1} \times \mathbb R$, $i \le n$. We perform ridge regression on $N$ random features of the form $\sigma(\boldsymbol w_a^{\mathsf T}\boldsymbol x)$, $a \le N$. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the generalization error, in the limit $N, n, d \to \infty$ with $N/d$ and $n/d$ fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures.}, added-at = {2020-02-20T18:42:52.000+0100}, author = {Mei, Song and Montanari, Andrea}, biburl = {https://www.bibsonomy.org/bibtex/2e4b7057208a5236c9169acdd14f2ce9f/kirk86}, description = {[1908.05355] The generalization error of random features regression: Precise asymptotics and double descent curve}, interhash = {996c6fd3db9ef82cf46c4885033ba9e8}, intrahash = {e4b7057208a5236c9169acdd14f2ce9f}, keywords = {generalization interpolation readings}, note = {cite arxiv:1908.05355Comment: We added two sections in version 3. One section provides the precise asymptotics of the training error. The other section describes a Gaussian covariate model, which gives the same asymptotic test error as the random features model}, timestamp = {2020-02-22T03:10:41.000+0100}, title = {The generalization error of random features regression: Precise asymptotics and double descent curve}, url = {http://arxiv.org/abs/1908.05355}, year = 2019 }

BibSonomy