Dropout is a special case of the stochastic delta rule: faster and more
accurate deep learning
N. Frazier-Logue, and S. Hanson. (2018)cite arxiv:1808.03578Comment: 6 pages, 7 figures; submitted to ICML.
Abstract
Multi-layer neural networks have lead to remarkable performance on many kinds
of benchmark tasks in text, speech and image processing. Nonlinear parameter
estimation in hierarchical models is known to be subject to overfitting and
misspecification. One approach to these estimation and related problems (local
minima, colinearity, feature discovery etc.) is called Dropout (Hinton, et al
2012, Baldi et al 2016). The Dropout algorithm removes hidden units according
to a Bernoulli random variable with probability $p$ prior to each update,
creating random "shocks" to the network that are averaged over updates. In this
paper we will show that Dropout is a special case of a more general model
published originally in 1990 called the Stochastic Delta Rule, or SDR (Hanson,
1990). SDR redefines each weight in the network as a random variable with mean
$\mu_w_ij$ and standard deviation $\sigma_w_ij$. Each weight random
variable is sampled on each forward activation, consequently creating an
exponential number of potential networks with shared weights. Both parameters
are updated according to prediction error, thus resulting in weight noise
injections that reflect a local history of prediction error and local model
averaging. SDR therefore implements a more sensitive local gradient-dependent
simulated annealing per weight converging in the limit to a Bayes optimal
network. Tests on standard benchmarks (CIFAR) using a modified version of
DenseNet shows the SDR outperforms standard Dropout in test error by approx.
$17\%$ with DenseNet-BC 250 on CIFAR-100 and approx. $12-14\%$ in smaller
networks. We also show that SDR reaches the same accuracy that Dropout attains
in 100 epochs in as few as 35 epochs.
Description
Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning
%0 Generic
%1 frazierlogue2018dropout
%A Frazier-Logue, Noah
%A Hanson, Stephen José
%D 2018
%K dl dropout
%T Dropout is a special case of the stochastic delta rule: faster and more
accurate deep learning
%U http://arxiv.org/abs/1808.03578
%X Multi-layer neural networks have lead to remarkable performance on many kinds
of benchmark tasks in text, speech and image processing. Nonlinear parameter
estimation in hierarchical models is known to be subject to overfitting and
misspecification. One approach to these estimation and related problems (local
minima, colinearity, feature discovery etc.) is called Dropout (Hinton, et al
2012, Baldi et al 2016). The Dropout algorithm removes hidden units according
to a Bernoulli random variable with probability $p$ prior to each update,
creating random "shocks" to the network that are averaged over updates. In this
paper we will show that Dropout is a special case of a more general model
published originally in 1990 called the Stochastic Delta Rule, or SDR (Hanson,
1990). SDR redefines each weight in the network as a random variable with mean
$\mu_w_ij$ and standard deviation $\sigma_w_ij$. Each weight random
variable is sampled on each forward activation, consequently creating an
exponential number of potential networks with shared weights. Both parameters
are updated according to prediction error, thus resulting in weight noise
injections that reflect a local history of prediction error and local model
averaging. SDR therefore implements a more sensitive local gradient-dependent
simulated annealing per weight converging in the limit to a Bayes optimal
network. Tests on standard benchmarks (CIFAR) using a modified version of
DenseNet shows the SDR outperforms standard Dropout in test error by approx.
$17\%$ with DenseNet-BC 250 on CIFAR-100 and approx. $12-14\%$ in smaller
networks. We also show that SDR reaches the same accuracy that Dropout attains
in 100 epochs in as few as 35 epochs.
@misc{frazierlogue2018dropout,
abstract = {Multi-layer neural networks have lead to remarkable performance on many kinds
of benchmark tasks in text, speech and image processing. Nonlinear parameter
estimation in hierarchical models is known to be subject to overfitting and
misspecification. One approach to these estimation and related problems (local
minima, colinearity, feature discovery etc.) is called Dropout (Hinton, et al
2012, Baldi et al 2016). The Dropout algorithm removes hidden units according
to a Bernoulli random variable with probability $p$ prior to each update,
creating random "shocks" to the network that are averaged over updates. In this
paper we will show that Dropout is a special case of a more general model
published originally in 1990 called the Stochastic Delta Rule, or SDR (Hanson,
1990). SDR redefines each weight in the network as a random variable with mean
$\mu_{w_{ij}}$ and standard deviation $\sigma_{w_{ij}}$. Each weight random
variable is sampled on each forward activation, consequently creating an
exponential number of potential networks with shared weights. Both parameters
are updated according to prediction error, thus resulting in weight noise
injections that reflect a local history of prediction error and local model
averaging. SDR therefore implements a more sensitive local gradient-dependent
simulated annealing per weight converging in the limit to a Bayes optimal
network. Tests on standard benchmarks (CIFAR) using a modified version of
DenseNet shows the SDR outperforms standard Dropout in test error by approx.
$17\%$ with DenseNet-BC 250 on CIFAR-100 and approx. $12-14\%$ in smaller
networks. We also show that SDR reaches the same accuracy that Dropout attains
in 100 epochs in as few as 35 epochs.},
added-at = {2019-02-11T11:10:32.000+0100},
author = {Frazier-Logue, Noah and Hanson, Stephen José},
biburl = {https://www.bibsonomy.org/bibtex/200853901983ad41bc7e5e371e0c39644/bechr7},
description = {Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning},
interhash = {74431432c91b350a930a384c2aeaab0a},
intrahash = {00853901983ad41bc7e5e371e0c39644},
keywords = {dl dropout},
note = {cite arxiv:1808.03578Comment: 6 pages, 7 figures; submitted to ICML},
timestamp = {2019-02-11T11:10:32.000+0100},
title = {Dropout is a special case of the stochastic delta rule: faster and more
accurate deep learning},
url = {http://arxiv.org/abs/1808.03578},
year = 2018
}