Abstract
Adjusting the learning rate schedule in stochastic gradient methods is an
important unresolved problem which requires tuning in practice. If certain
parameters of the loss function such as smoothness or strong convexity
constants are known, theoretical learning rate schedules can be applied.
However, in practice, such parameters are not known, and the loss function of
interest is not convex in any case. The recently proposed batch normalization
reparametrization is widely adopted in most neural network architectures today
because, among other advantages, it is robust to the choice of Lipschitz
constant of the gradient in loss function, allowing one to set a large learning
rate without worry. Inspired by batch normalization, we propose a general
nonlinear update rule for the learning rate in batch and stochastic gradient
descent so that the learning rate can be initialized at a high value, and is
subsequently decreased according to gradient observations along the way. The
proposed method is shown to achieve robustness to the relationship between the
learning rate and the Lipschitz constant, and near-optimal convergence rates in
both the batch and stochastic settings ($O(1/T)$ for smooth loss in the batch
setting, and $O(1/T)$ for convex loss in the stochastic setting). We
also show through numerical evidence that such robustness of the proposed
method extends to highly nonconvex and possibly non-smooth loss function in
deep learning problems.Our analysis establishes some first theoretical
understanding into the observed robustness for batch normalization and weight
normalization.
Users
Please
log in to take part in the discussion (add own reviews or comments).